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Preface to the First Edition 


In the past half-century the theory of probability has grown from a minor 
isolated theme into a broad and intensive discipline interacting with many 
other branches of mathematics. At the same time it is playing a central role 
in the mathematization of various applied sciences such as statistics, opera- 
tions research, biology, economics and psychology—to name a few to which 
the prefix ‘“‘mathematical” has so far been firmly attached. The coming-of-age 
of probability has been reflected in the change of contents of textbooks on 
the subject. In the old days most of these books showed a visible split- 
personality torn between the combinatorial games of chance and the so-called 
“theory of errors’ centering in the normal distribution. This period ended 
with the appearance of Feller’s classic treatise (see [Feller 1]t) in 1950, from 
the manuscript of which I gave my first substantial course in probability. 
With the passage of time probability theory and its applications have won a 
place in the college curriculum as a mathematical discipline essential to many 
fields of study. The elements of the theory are now given at different levels, 
sometimes even before calculus. The present textbook is intended for a course 
at about the sophomore level. It presupposes no prior acquaintance with the 
subject and the first three chapters can be read largely without the benefit of 
calculus. The next three chapters require a working knowledge of infinite 
series and related topics, and for the discussion involving random variables 
with densities some calculus is of course assumed. These parts dealing with 
the “continuous case’’ as distinguished from the “discrete case”’ are easily 
separated and may be postponed. The contents of the first six chapters should 
form the backbone of any meaningful first introduction to probability theory. 
Thereafter a reasonable selection includes: §7.1 (Poisson distribution, which 
may be inserted earlier in the course), some kind of going over of §7.3, 7.4, 7.6 
(normal distribution and the law of large numbers), and §8.1 (simple random 
walks which are both stimulating and useful). All this can be covered in a 
semester but for a quarter system some abridgment will be necessary. Specifi- 
cally, for such a short course Chapters 1 and 3 may be skimmed through and 
the asterisked material omitted. In any case a solid treatment of the normal 
approximation theorem in Chapter 7 should be attempted only if time is 
available as in a semester or two-quarter course. The final Chapter 8 gives a 
self-contained elementary account of Markov chains and is an extension of 
the main course at a somewhat more mature level. Together with the aster- 
isked sections 5.3, 5.4 (sequential sampling and Pdélya urn scheme) and 7.2 
(Poisson process), and perhaps some filling in from the Appendices, the 
material provides a gradual and concrete passage into the domain of sto- 


{ Names in square brackets refer to the list of General References on p. 307. William Feller 
(1906-1970). 
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chastic processes. With these topics included the book will be suitable for a 
two-quarter course, of the kind that I have repeatedly given to students of 
mathematical sciences and engineering. However, after the preparation of 
the first six chapters the reader may proceed to more specialized topics treated 
e.g. in the above mentioned treatise by Feller. If the reader has the adequate 
mathematical background, he will also be prepared to take a formal rigorous 
course such as presented in my own more advanced book [Chung 1]. 

Much thought has gone into the selection, organization and presentation 
of the material to adapt it to classroom uses, but I have not tried to offer a 
slick package to fit in with an exact schedule or program such as popularly 
demanded at the quick-service counters. A certain amount of flexibility and 
choice is left to the instructor who can best judge what is right for his class. 
Each chapter contains some easy reading at the beginning, for motivation 
and illustration, so that the instructor may concentrate on the more formal 
aspects of the text. Each chapter also contains some slightly more challenging 
topics (e.g., §1.4, 2.5) for optional sampling. They are not meant to deter the 
beginner but to serve as an invitation to further study. The prevailing empha- 
sis is on the thorough and deliberate discussion of the basic concepts and 
techniques of elementary probability theory with few frills and minimal tech- 
nical complications. Many examples are chosen to anticipate the beginners’ 
difficulties and to provoke better thinking. Often this is done by posing and 
answering some leading questions. Historical, philosophical and personal 
comments are inserted to add flavor to this lively subject. It is my hope that 
the reader will not only learn something from the book but may also derive 
a measure of enjoyment in so doing. 

There are over two hundred exercises for the first six chapters and some 
eighty more for the last two. Many are easy, the harder ones indicated by 
asterisks, and all answers gathered at the end of the book. Asterisked sections 
and paragraphs deal with more special or elaborate material and may be 
skipped, but a little browsing in them is recommended. 

The author of any elementary textbook owes of course a large debt to 
innumerable predecessors. More personal indebtedness is acknowledged be- 
low. Michel Nadzela wrote up a set of notes for a course I gave at Stanford 
in 1970. Gian-Carlo Rota, upon seeing these notes, gave me an early impetus 
toward transforming them into a book. D. G. Kendall commented on the 
first draft of several chapters and lent further moral support. J. L. Doob 
volunteered to read through most of the manuscript and offered many helpful 
suggestions. K. B. Erickson used some of the material in a course he taught. 
A. A. Balkema checked the almost final version and made numerous improve- 
ments. Dan Rudolph read the proofs together with me. Perfecto Mary drew 
those delightful pictures. Gail Lemmond did the typing with her usual ef- 
ficiency and dependability. Finally, it is a pleasure to thank my old publisher 
Springer-Verlag for taking my new book to begin a new series of under- 


graduate texts. 
K. L. C. 


March 1974. 


Preface to the Second Edition 


A determined effort was made to correct the errors in the first edition. 
This task was assisted by: Chao Hung-po, J. L. Doob, R. M. Exner, W. H. 
Fleming, A. M. Gleason, Karen Kafador, S. H. Polit, and P. van Moerbeke. 
Miss Kafador and Dr. Polit compiled particularly careful lists of suggestions. 
The most distressing errors were in the Solutions to Problems. All of them 
have now been checked by myself from Chapter | to 5, and by Mr. Chao 
from Chapter 6 to 8. It is my fervent hope that few remnant mistakes remain 
in that sector. A few small improvements and additions were also made, but 
not all advice can be heeded at this juncture. Users of the book are implored 
to send in any criticism and commentary, to be taken into consideration in a 
future edition. Thanks are due to the staff of Springer-Verlag for making this 
revision possible so soon after the publication of the book. 


K.L.C. 
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Chapter I 


Set 


1.1. Sample sets 


These days school children are taught about sets. A second grader* was 
asked to name “the set of girls in his class.” This can be done by a complete 
list such as: 


“Nancy, Florence, Sally, Judy, Ann, Barbara, .. .” 


A problem arises when there are duplicates. To distinguish between two 
Barbaras one must indicate their family names or call them B, and Be. The 
same member cannot be counted twice in a set. 

The notion of a set is common in all mathematics. For instance in geom- 
etry one talks about “the set of points which are equi-distant from a given 
point.”’ This is called a circle. In algebra one talks about “the set of integers 
which have no other divisors except 1 and itself.’’ This is called the set of 
prime numbers. In calculus the domain of definition of a function is a set of 
numbers, e.g., the interval (a, b); so is the range of a function if you remember 
what it means. 

In probability theory the notion of a set plays a more fundamental role. 
Furthermore we are interested in very general kinds of sets as well as specific 
concrete ones. To begin with the latter kind, consider the following examples: 


(a) a bushel of apples; 

(b) fifty five cancer patients under a certain medical treatment; 
(c) all the students in a college; 

(d) all the oxygen molecules in a given container; 

(e) all possible outcomes when six dice are rolled; 

(f) all points on a target board. 


Let us consider at the same time the following “smaller’’ sets: 


(a’) the rotten apples in that bushel; 

(b’) those patients who respond positively to the treatment; 

(c’) the mathematics majors of that college; 

(d’) those molecules which are traveling upwards; 

(e’) those cases when the six dice show different faces; 

(f’) the points in a little area called the “bull’s eye” on the board. 


* My son Daniel. 
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We shall set up a mathematical model for these and many more such 
examples that may come to mind, namely we shall abstract and generalize 
our intuitive notion of “‘a bunch of things.” First we call the things points, 
then we call the bunch a space; we prefix them by the word “sample”’ to 
distinguish these terms from other usages, and also to allude to their statistical 
origin. Thus a sample point is the abstraction of an apple, a cancer patient, 
a student, a molecule, a possible chance outcome, or an ordinary geometrical 
point. The sample space consists of a number of sample points, and is just a 
name for the totality or aggregate of them all. Any one of the examples (a)-(f) 
above can be taken to be a sample space, but so also may any one of the 
smaller sets in (a’)-(f’). What we choose to call a space [a universe] is a relative 
matter. 

Let us then fix a sample space to be denoted by Q, the capital Greek letter 
omega. It may contain any number of points, possibly infinite but at least one. 
(As you have probably found out before, mathematics can be very pedantic!) 
Any of these points may be denoted by w, the small Greek letter omega, to 
be distinguished from one another by various devices such as adding sub- 
Scripts or dashes (as in the case of the two Barbaras if we do not know their 
family names), thus a, w2, w’,.... Any partial collection of the points is a 
subset of Q, and since we have fixed Q we will just call it a set. In extreme cases 
a set may be (? itself or the empty set which has no point in it. You may be 
surprised to hear that the empty set is an important entity and is given a 
special symbol @. The number of points in a set S will be called its size and 
denoted by |S], thus it is a nonnegative integer or ©. In particular |@| = 0. 

A particular set S is well defined if it is possible to tell whether any given 
point belongs to it or not. These two cases are denoted respectively by 


w€ SS; wS. 


Thus a set is determined by a specified rule of membership. For instance, the 
sets in (a’)-(f’) are well defined up to the limitations of verbal descriptions. 
One can always quibble about the meaning of words such as “a rotten apple,” 
or attempt to be funny by observing, for instance, that when dice are rolled 
on a pavement some of them may disappear into the sewer. Some people of 
a pseudo-philosophical turn of mind get a lot of mileage out of such caveats, 
but we will not indulge in them here. Now, one sure way of specifying a rule 
to determine a set is to enumerate all its members, namely to make a complete 
list as the second grader did. But this may be tedious if not impossible. For 
example, it will be shown in §3.1 that the size of the set in (e) is equal to 
6° = 46656. Can you give a quick guess how many pages of a book like this 
will be needed just to record all these possibilities of a mere throw of six dice? 
On the other hand it can be described in a systematic and unmistakable way 
as the set of all ordered 6-tuples of the form below: 


(Si, Sey S35 S4y Soy Se) 


1.2. Operations with sets 3 


where each of the symbols s,;, 1 <j < 6, may be any of the numbers 1, 2, 3, 
4, 5, 6. This is a good illustration of mathematics being economy of thought 
(and printing space). 

If every point of A belongs to B, then A is contained or included in B and 
is a subset of B, while B is a superset of A. We write this in one of the two 
ways below: 


ACB, BDA. 


Two sets are identical if they contain exactly the same points, and then we 
write 


A= B. 


Another way to say this is: 4 = Bif and onlyif A C Band B C A. This may 
sound unnecessarily roundabout to you, but is often the only way to check 
that two given sets are really identical. It is not always easy to identify 
two sets defined in different ways. Do you know for example that the set 
of even integers is identical with the set of all solutions x of the equation 
sin (4x/2) = 0? We shall soon give some examples of showing the identity 
of sets by the roundabout method. 


1,2. Operations with sets 


We learn about sets by operating on them, just as we learn about numbers by 
operating on them. In the latter case we say also that we compute with 
numbers: add, subtract, multiply, and so on. These operations performed 
on given numbers produce other numbers which are called their sum, differ- 
ence, product, etc. In the same way, operations performed on sets produce 
other sets with new names. We are now going to discuss some of these and 
the laws governing them. 


Complement. The complement of a set A is denoted by A? and is the set of 
points which do not belong to A. Remember we are talking only about points 
in a fixed Q! We write this symbolically as follows: 


Ae = {w|w F A} 
which reads: ‘‘A°¢ is the set of w which does not belong to A.” In particular 


Q¢ = & and @* = Q. The operation has the property that if it is performed 
twice in succession on A, we get A back: 


(1.2.1) (Aye = A, 


Union. The union A U B of two sets A and Bis the set of points which belong 
to at least one of them. In symbols: 
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Figure 1 


AU B= {o|w€ A orwE€ B} 


where “‘or” means “‘and/or”’ in pedantic [legal] style, and will always be used 
in this sense. 


Intersection. The intersection A () B of two sets A and B is the set of points 
which belong to both of them. In symbols: 


A(\B= {w|w € A and w € B}. 


We hold the truth of the following laws as self-evident: 
Commutative Law. 4U B= BUA, A(\ B= B()\A. 
Associative Law. (A U B) UC =AU(BU O), 

(AN BNC=AN(BN C). 


But observe that these relations are instances of identity of sets mentioned 
above, and are subject to proof. They should be compared, but not confused, 
with analogous laws for sum and product of numbers: 


a+b=b+aaxXb=bxXa 
(a+ b)+c=at+(6+0,@xXbdxXc=axbXc). 


Brackets are needed to indicate the order in which the operations are to be 
performed. Because of the associative laws, however, we can write 


AUBUC, ANBNOCND, 
without brackets. But a string of symbols like A \U BQ C is ambiguous, 


therefore not defined; indeed (A LU B) () Cis not identical with A U (BQ C). 
You should be able to settle this easily by a picture. 
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(AUB) MC AU (BNC) 
Figure 2 


The next pair of distributive laws connect the two operations as follows: 
(Di) (AUBNC=ANQQUBNO); 
(Dz) (AN B)UC=(AUQN(BUC). 


(AUB)NC = (ANC)U(BNC) (ANB)UC = (AUC) (BUC) 
Figure 3 


Several remarks are in order. First, the analogy with arithmetic carries over 
to (D;): 


@t+tb)xc=@Xo)+6Xo); 
but breaks down in (D.): 
(ax b+cec#¥(atoXxXbt+ oO). 


Of course the alert reader will have observed that the analogy breaks down 
already at an earlier stage, for 


A=AUA=A/()\A; 


6 Set 


but the only number a satisfying the relation a + a = ais 0; while there are 
exactly two numbers satisfying a X a = a, namely 0 and 1. 

Second, you have probably already discovered the use of diagrams to 
prove or disprove assertions about sets. It is also a good practice to see the 


truth of such formulas as (Di) and (D.) by well-chosen examples. Suppose 
then 


A = inexpensive things, B = really good things, 
C = food [edible things]. 


Then (A U B)() C means ‘(inexpensive or really good) food,” while 
(A (1) C)U (BQ C) means “(inexpensive food) or (really good food).” So 
they are the same thing alright. This does not amount to a proof, as one 
swallow does not make a summer, but if one is convinced that whatever 
logical structure or thinking process involved above in no way depends on 
the precise nature of the three things A, B and C, so much so that they can 
be anything, then one has in fact landed a general proof. Now it is interesting 
that the same example applied to (D2) somehow does not make it equally 
obvious (at least to the author). Why? Perhaps because some patterns of logic 
are IN more common use in our everyday experience than others. 

This last remark becomes more significant if one notices an obvious 
duality between the two distributive laws. Each can be obtained from the 
other by switching the two symbols and (). Indeed each can be deduced 
from the other by making use of this duality (Exercise 11). 

Finally, since (D2) comes less naturally to the intuitive mind, we will avail 
ourselves of this opportunity to demonstrate the roundabout method of 
identifying sets mentioned above by giving a rigorous proof of the formula. 
According to this method, we must show: (i) each point on the left side of 
(D2) belongs to the right side; (ii) each point on the right side of (D.) belongs 
to the left side. 


(i) Suppose w belongs to the left side of (D2), then it belongs either to A () B 
orto C.Ifw € A) B,thenw € A, hencew € AU C;similarlyw € BU C. 
Therefore w belongs to the right side of (De). On the other hand if w € C, 
then w € AW Candw€ BL C and we finish as before. 


(ii) Suppose w belongs to the right side of (D2), then w may or may not belong 
to C, and the trick is to consider these two alternatives. If w € C, then it 
certainly belongs to the left side of (D.). On the other hand, if w Z C, then 
since it belongs to A LU C, it must belong to A; similarly it must belong to B. 
Hence it belongs to A () B, and so to the left side of (D2). Q.E.D. 


1.3. Various relations 


The three operations so far defined: complement, union and intersection obey 
two more laws called De Morgan’s laws: 
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(Ci) (AU By = A“) B; 


(Cs) (AK) BY = AUB. 


| EEGREEREREES 
SCE eee ee eet et 
(AM B)S = AS UBS 
Figure 4 


They are dual in the same sense as (D,) and (D.) are. Let us check these by 
our previous example. If A = inexpensive, and B = really good, then clearly 
(A LU B) = not inexpensive nor really good, namely high-priced junk, which 
is the same as A¢ () B* = expensive and not really good. Similarly we can 
check (C,). 

Logically, we can deduce either (C,) or (C,) from the other; let us show 
it one way. Suppose then (C)) is true, then since A and B are arbitrary sets 
we can substitute their complements and get 


(1.3.1) (A°\U By = (A) () (BY! = AN) B 


where we have also used (1.2.1) for the second equation. Now taking the 
complements of the first and third sets in (1.3.1) and using (1.2.1) again 
we get 

Ac.) B= (AN) BY. 


This is (C2). Q.E.D. 

It follows from the De Morgan’s laws that if we have complementation, 
then either union or intersection can be expressed in terms of the other. Thus 
we have 

A()\ B= (AU BY, 
AU B= (A°() B); 


and so there is redundancy among the three operations. On the other hand 
it is impossible to express complementation by means of the other two, al- 
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though there is a magic symbol from which all three can be derived (Exercise 
14). It is convenient to define some other operations, as we now do. 


Difference. The set A\B is the set of points which belong to A and (but) not 
to B. In symbols: 


A\B=A(Q)\ B= fw|we€ A anda €C B}. 


Figure 5 


This operation is neither commutative nor associative. Let us find a counter- 
example to the associative law, namely, to find some A, B, C for which 


(1.3.2) (A\B)\C # A\(B\C). 


Note that in contrast to a proof of identity discussed above, a single instance 
of falsehood will destroy the identity. In looking for a counter-example one 
usually begins by specializing the situation to reduce the “unknowns.” So 
try B = C. The left side of (1.3.2) becomes A\B, while the right side becomes 
A\@ = A. Thus we need only make A\B = A, and that is easy. 

In case A _ B we write A — B for A\B. Using this new symbol we have 


A\B = A— (AQ) B); 
and 
At =Q— A. 
The operation “‘—” has some resemblance to the arithmetic operation of 
subtracting, in particular 4 — A = @, but the analogy does not go very far. 
For instance, there is no analogue to (a + b) —c=a-+(6— cc). 


Symmetric Difference. The set A A B is the set of points which belong to 
exactly one of the two sets A and B. In symbols: 


AA B=(A(Q) B)U (4°¢N B) = (A\B) U (B\A). 
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“(AUB)\C | | AU(B\C) 
Figure 6 


This operation is useful in advanced theory of sets. As its name indicates, it 
is symmetric with respect to A and B, which is the same as saying that it is 
commutative. Is it associative? Try some concrete examples or diagrams 
which have succeeded so well before, and you will probably be as quickly con- 
fused as I am. But the question can be neatly resolved by a device to be 
introduced in §1.4. 

Having defined these operations, we should let our fancy run free for a 
few moments and imagine all kinds of sets that can be obtained by using 
them in succession in various combinations and permutations, such as 


[((A\C) 1 (BU C)]e U (4° A B). 


But remember we are talking about subsets of a fixed 2, and if Q is a finite 
set the number of distinct subsets is certainly also finite, so there must be a 
tremendous amount of nter-relationship among these sets that we can build 
up. The various laws discussed above are just some of the most basic ones, 
and a few more will be given among the exercises below. 

An extremely important relation between sets will now be defined. Two 
sets A and B are said to be disjoint when they do not intersect, namely, have 
no point in common: 


A(\B= @. 
This is equivalent to either one of the following inclusion conditions: 
AC B; BCA 


Any number of sets are said to be disjoint when every pair of them are dis- 
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joint as just defined. Thus, “A, B, C are disjoint’? means more than just 
A(\BC\C = @; it means 


A(\B=2,ANC=0,BNC=2. 


A. B, C disjoint  ANBNC= o 


Figure 7 
From here on we will omit the intersection symbol and write simply 
AB for AQX\B 


just as we write ab for a X b. When A and B are disjoint we will write 
sometimes 


A+B for AUB. 


But be careful: not only does “+” mean addition for numbers but even 
when A and B are sets there are other usages of A + B such as their vectorial 
sum. 

For any set A, we have the obvious decomposition: 


(1.3.3) Q=A+ Ae 


The way to think of this is: the set A gives a classification of all points w in Q 
according as w belongs to A or to A*. A college student may be classified 
according as he is a mathematics major or not, but he can also be classified 
according as he is a freshman or not, of voting age or not, has a car or 
not, ..., 1s a girl or not. Each two-way classification divides the sample 
Space into two disjoint sets, and if several of these are superimposed on each 
other we get, e.g., 


(1.3.4) @=(A+A)(B+ B) = AB+ ABO + AB+ AB, 
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(1.3.5) @ = (A+ A\B+ BYC + C’) = ABC + ABC’ + ABC 
+ ABC! + A*BC + A°BCe + ABC + AeBeC, 


A AS 
Figure 8 


Let us call the pieces of such a decomposition the atoms. There are 2, 4, 8 atoms 
respectively above according as 1, 2, 3 sets are considered. In general there 
will be 2” atoms if 1 sets are considered. Now these atoms have a remarkable 
property which will be illustrated in the case (1.3.5), as follows: no matter 
how you operate on the three sets A, B, C, and no matter how many times 
you do it, the resulting set can always be written as the union of some of the 
atoms. Here are some examples: 


AW B= ABC + ABC’ + ABC + ABC: + A*°BC* + A°BC 
(A\B)\\C: = ABC 
(A A B)Ce = ABC* + AcBC*. 
Can you see why? 
Up to now we have considered only the union or intersection of a finite 
number of sets. There is no difficulty in extending this to an infinite number 


of sets. Suppose a finite or infinite sequence of sets A,, nm = 1, 2,..., is 
given, then we can form their union and intersection as follows: 


LU A, = {w|w € A, for at least one value of n}, 


(\ An = {w|w € A, for all values of n}. 
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When the sequence is infinite these may be regarded as obvious “‘set limits’’ 
of finite unions or intersections, thus: 


iCs 


A, = lim UW) An; rn) A, = lim rn) An, . 
1 n=1 


1 mo n= moo n=l 


Observe that as m increases, ) A, does not decrease while () A, does not 


n=1 n=l 


increase, and we may say that the former swells up to -U A,, the latter shrinks 
1 


r= 


down to () Ap. 
n=1 


The distributive laws and De Morgan’s laws have obvious extensions to 
a finite or infinite sequence of sets. For instance 


(U 4n) VB = UAB) 
(Qay= ya 


Really interesting new sets are produced by using both union and inter- 
section an infinite number of times, and in succession. Here are the two most 
prominent ones: 


A(0.4): 0.(6.4). 


These belong to a more advanced course (see [Chung 1; §4.2] of the Refer- 
ences). They are shown here as a preview to arouse your curiosity. 


1.4.* Indicator 


The idea of classifying w by means of a dichotomy: to be or not to be in A, 
which we discussed toward the end of §1.3, can be quantified into a useful 
device. This device will generalize to the fundamental notion of “random 
variable” in Chapter 4. 

Imagine © to be a target board and A a certain marked area on the board 
as in Examples (f) and (f’) above. Imagine that “pick a point w in Q” is done 
by shooting a dart at the target. Suppose a bell rings (or a bulb lights up) 
when the dart hits within the area A; otherwise it is a dud. This is the intuitive 
picture expressed below by a mathematical formula: 


lifw€ A, 


La) = Oifw Z A. 


* This section may be omitted after the first three paragraphs. 
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Figure 9 


Thus the symbol J, is a function which is defined on the whole sample space 
Q and takes only the two values 0 and 1, corresponding to a dud and a ring. 
You may have learned in a calculus course the importance of distinguishing 
between a function (sometimes called a mapping) and one of its values. Here 
it is the function J, that indicates the set A, hence it is called the indicator 
function, or briefly, indicator of A. Another set B has its indicator Jz. The 
two functions J, and J are identical (what does that mean?) if and only if 
the two sets are identical. 

To see how we can put indicators to work, let us figure out the indicators 
for some of the sets discussed before. We need two mathematical symbols 
V (cup) and A (cap) which may be new to you. For any two real numbers a 
and 6, they are defined as follows: 


a VV b = maximum of a and b, 
(1.4.1) 


a A 6 = minimum of a and b. 


In case a = 5B, either one of them will serve as maximum as well as minimum. 
Now the salient properties of indicators are given by, the formulas below: 


(1.4.2) Tana) = 4) A Ibe) = La): Bo); 
(1.4.3) I UB(w) = I4(w) V Ip(w). 


You should have no difficulty checking these equations, after all there are 
only two possible values 0 and 1 for each of these functions. Since the equa- 
tions are true for every w, they can be written more simply as equations 
(identities) between functions: 
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(1.4.4) Tans = I, A lp oa I,- In, 
(1.4.5) TAuB = I, V Tp. 


Here for example the function J, A Jp is that mapping which assigns to each 
w the value J4(w) A Jp(w), just as in calculus the function f + g is that mapping 
which assigns to each x the number f(x) + g(x). 

After observing the product [4(w)-Jp() at the end of (1.4.2) you may be 
wondering why we do not have the sum J4(w) + Jp(w) in (1.4.3). But if this 
were so we could get the value 2: there, which is impossible since the first 
member J, y p(w) cannot take this value. Nevertheless, shouldn’t J, + J, mean 
something? Consider target shooting again but this time mark out two over- 
lapping areas A and B. Instead of bell-ringing, you get 1 penny if you hit 
within A, and also if you hit within B. What happens if you hit the intersection 
AB? That depends on the rule of the game. Perhaps you still get 1 penny, 
perhaps you get 2 pennies. Both rules are legitimate. In formula (1.4.3) it is 
the first rule that applies. If you want to apply the second rule, then you are 
no longer dealing with the set A  B alone as in Figure 10a, but something 
like Figure 10b: 


AUB 


Figure 10a Figure 10b 


This situation can be realized electrically by laying first a uniform charge over 
the area A, and then on top of this, another charge over the area B, so that 
the resulting total charge is distributed as shown in Figure 10b. In this case 
the variable charge will be represented by the function J4 + Js. Such a sum 
of indicators is a very special case of sum of random variables which will 
Occupy us in later chapters. 

For the present let us return to formula (1.4.5) and note that if the two 
sets A and B are disjoint, then it indeed reduces to the sum of the indicators, 
because then at most one of the two indicators can take the value 1, so that 
the maximum coincides with the sum, namely 
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OvVv0=04+0,0V1=04+1,1V0=1-40. 
Thus we have 
(1.4.6) Ing, =I4+ Ip provided AC) B= @. 
As a particular case, we have for any set A: 
Io = Ig + Lae. 

Now /g is the constant function 1 (on 2), hence we may rewrite the above as 
(1.4.7) Ige= 1 — Ig. 


We can now derive an interesting formula. Since (A U B)° = A°B’, we get by 
applying (1.4.7), (1.4.4) and then (1.4.7) again: 


TAUB = l — LT AcRe = l — Tcl Be = ] — d — I) — Ip). 


Multiplying out the product (we are dealing with numerical functions!) and 
transposing terms we obtain 


(1.4.8) lave t+ Jang = In t Ip. 


Finally we want to investigate /4,~. We need a bit of arithmetic (also 
called number theory) first. All integers can be classified as even or odd, 
according as the remainder we get when we divide it by 2 is 0 or 1. Thus each 
integer may be identified with (or reduced to) 0 or I, provided we are only 
interested in its parity and not its exact value. When integers are added or 


subtracted subject to this reduction, we say we are operating modulo 2. For 
instance: 


§+74+8—-143=14+1+0—-1+1=2=0, modulo2. 


A famous case of this method of counting occurs when the maiden picks off 
the petals of some wild flower one by one and murmers: “‘he loves me,”’ “‘he 


loves me not” in turn. Now you should be able to verify the following equa- 
tion for every w: 


Tg ap(w) = L4(w) + La) — 2Laa) 
= J4(w) + Ip(w), modulo 2. 


(1.4.9) 


We can now settle a question raised in Sec. 1.3, and establish without pain 
the identity: 


(1.4.10) (AA BAC=AA(BAC). 
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Proof. Using (1.4.9) twice we have 
(1.4.11) TA AB) AC = TA AB + Io = OF -+- Iz) + Ie, modulo 2. 


Now if you have understood the meaning of addition modulo 2 you should 
see at once that it is an associative operation (what does that mean, “‘modulo 
2”). Hence the last member of (1.4.11) is equal to 


In +(e + Ic) = Ln + Ipac = LAA AO: modulo 2. 


We have therefore shown that the two sets in (1.4.10) have identical indicators, 
hence they are identical. Q.E.D. 

We do not need this result below. We just want to show that a trick is 
sometimes neater than a picture! 


Exercises 


—" 
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Why is the sequence of numbers {1, 2, 1, 2, 3} not a set? 

2. If two sets have the same size, are they then identical? 

3. Can a set and a proper subset have the same size? (A proper subset is a 
subset which is not also a superset!) 

4. If two sets have identical complements, then they are themselves iden- 
tical. Show this in two ways: (i) by verbal definition, (ii) by using 
formula (1.2.1). 

5. If A, B, C have the same meanings as in Section 1.2, what do the 

following sets mean: 


AU (BN ©); (A\B)\\C; A\(B\C). 
6. Show that 
(AUB NCHAUBNO) 


but also give some special cases where there is equality. 
7. Using the atoms given in the decomposition (1.3.5), express 


AUBUCi(AU BYBU ©); A\B: AA B; 


the set of w which belongs to exactly 1 [exactly 2; at least 2] of the sets 
A, B, C. 
8. Show that A C Bif and only if AB = A; or A U B = B. (So the rela- 
tion of inclusion can be defined through identity and the operations.) 
9. Show that A and B are disjoint if and only if A\B = A; or AU B= 
A A B. (After No. 8 is done, this can be shown purely symbolically 
without going back to the verbal definitions of the sets.) 
10. Show that there is a distributive law also for difference: 
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13.* 


(A\B) 1 C = (AN CK\(BN C). 


Is the dual 

(A 1) B)\C = (A\C) NO (B\C) 
also true? 
Derive (D2) from (D,) by using (C;) and (C)). 
Show that 


(4U BV\(CU D)C (A\C) U (B\D). 
Let us define a new operation ‘‘/”’ as follows: 


A/B = ACU B. 
Show that 
Gi) (4/B) ) (B/C) C A/C; 
(Gi) (A/B) 1 (A/C) = A/BC; 
(iii) (A/B) ( (B/A) = (A A BY. 


In intuitive logic, ““A/B’’ may be read as “‘A implies B.”’ Use this to 
interpret the relations above. 

If you like a “dirty trick’’ this one is for you. There is an operation 
between two sets A and B from which alone all the operations defined 
above can be derived. [Hint: It is sufficient to derive complement and 
union from it. Look for some combination which contains these two. 
It is not unique. | 

Show that A C Bif and only if I, < Jp; and A () B = @ if and only 
if I ALB — 0. 

Think up some concrete schemes which illustrate formula (1.4.8). 

Give a direct proof of (1.4.8) by checking it for all w. You may use the 
atoms in (1.3.4) if you want to be well organized. 

Show that for any real numbers a and b, we have 


a+b=(avV b)+(aA BD). 


Use this to prove (1.4.8) again. 

Express J4\zg and J4_ in terms of [4 and Jz. 

Express I4ypuc aS a polynomial of I4, Js, Ic. [Hint: Consider 
1 — Jausue.] 

Show that 


Iago = Ia + Ip + Ie — Laus — Laue — Ipuc + Jausue. 


You can verify this directly, but it is nicer to derive it from No. 20 
by duality. 


Chapter 2 
Probability 


2.1. Examples of probability 


We learned something about sets in Chapter 1; now we are going to 
measure them. The most primitive way of measuring is to count the number, 
so we will begin with such an example. 


Example 1. In Example (a’) of §1.1, suppose that the number of rotten ap- 
ples is 28. This gives a measure to the set A described in (a’), called its size 
and denoted by |A|. But it does not tell anything about the total number of 
apples in the bushel, namely the size of the sample space Q given in Example 
(a). If we buy a bushel of apples we are more likely to be concerned with the 
relative proportion of rotten ones in it rather than their absolute number. 
Suppose then the total number is 550. If we now use the letter P provisionarily 
for “proportion,” we can write this as follows: 


(2.1.1) P(A) = a - 2 


Suppose next that we consider the set B of unripe apples in the same bushel, 
whose number is 47. Then we have similarly 


It seems reasonable to suppose that an apple cannot be both rotten and unripe 
(this is really a matter of definition of the two adjectives); then the two sets 
are disjoint so their members do not overlap. Hence the number of “rotten 
or unripe apples’’ is equal to the sum of the number of “rotten apples” 
and the number of “unripe apples’: 28 + 47 = 75. This may be written in 
symbols as: 


(2.1.2) |A + Bl = |A| + |B). 

If we now divide through by |Q|, we obtain 

(2.1.3) P(A + B) = P(A) + P(B). 

On the other hand, if some apples can be rotten and unripe at the same time, 


such as when worms got into green ones, then the equation (2.1.2) must be 
replaced by an inequality: 
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|A U Bl < |A| + |B 
which leads to 


(2.1.4) P(AU B) < P(A) + P(B). 


Now what is the excess of |A| + |B] over |A U B|? It is precisely the number 
of “rotten and unripe apples,” that is, |A } B|. Thus 


|A U Bl + [A 1) Bl = [Al + [BI 
which yields the pretty equation 
(2.1.5) P(A U B)+ P(A (1) B) = P(A) + PCB). 


Example 2. A more sophisticated way of measuring a set is the area of a 
plane set as in Examples (f) and (f’) of Section 1.1, or the volume of a solid. 
It is said that the measurement of land areas was the origin of geometry and 
trigonometry in ancient times. While the nomads were still counting on their 
fingers and toes as in Example 1, the Chinese and Egyptians, among other 
peoples, were subdividing their arable lands, measuring them in units and 
keeping accounts of them on stone tablets or papyrus. This unit varied a 
great deal from one civilization to another (who knows the conversion rate 
of an acre into mou’s or hectares?). But again it is often the ratio of two 
areas which concerns us as in the case of a wild shot which hits the target 
board. The proportion of the area of a subset A to that of 2 may be written, 
if we denote the area by the symbol |_|: 


(2.1.6) P(A) = iar 


This means also that if we fix the unit so that the total area of Q is 1 unit, 
then the area of A is equal to the fraction P(A) on this scale. Formula (2.1.6) 
looks just like formula (2.1.1) by the deliberate choice of notation in order 
to underline the similarity of the two situations. Furthermore, for two sets 
A and B the previous relations (2.1.3) to (2.1.5) hold equally well in their new 
interpretations. 


Example 3. When a die is thrown there are six possible outcomes. If we com- 
pare the process of throwing a particular number [face] with that of picking 
a particular apple in Example 1, we are led to take Q = {1, 2, 3, 4, 5, 6} and 
define 


(2.1.7) P({k}) = 7 k = 1, 2, 3, 4, 5, 6. 


Here we are treating the six outcomes as “equally likely,” so that the same 
measure is assigned to all of them, just as we have done tacitly with the apples. 
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This hypothesis is usually implied by saying that the die is “perfect.” In 
reality of course no such die exists. For instance the mere marking of the 
faces would destroy the perfect symmetry; and even if the die were a perfect 
cube, the outcome would still depend on the way it is thrown. Thus we 
must stipulate that this is done in a perfectly symmetrical way too, and so on. 
Such conditions can be approximately realized and constitute the basis of an 
assumption of equal likelihood on grounds of symmetry. 

Now common sense demands an empirical interpretation of the “proba- 
bility” given in (2.1.7). It should give 4 measure of what is Jikely to happen, 
and this is associated in the intuitive mind with the observable frequency of 
occurrence. Namely, if the die is thrown a number of times, how often will a 
particular face appear? More generally, let A be an event determined by the 
outcome; e.g. “‘to throw a number not less than 5 [or an odd number].” 
Let N,(A) denote the number of times the event A is observed in 1 throws, 
then the relative frequency of A in these trials is given by the ratio 


N,(A 
(2.1.8) Q,(4) = “4. 
There is good reason to take this Q, as a measure of A. Suppose B is another 
event such that A and B are incompatible or mutually exclusive in the sense 
that they cannot occur in the same trial. Clearly we have N,(A + B) = 
N,(A) + N,(B) and consequently 


0,(4 + B) = AES) 


_ N(A) + Nn(B) _ Nn(A) 1 N,(B) 
n n n 


(2.1.9) 
= Q,(A) + Q,(B). 


Similarly for any two events A and B in connection with the same game, not 
necessarily incompatible, the relations (2.1.4) and (2.1.5) hold with the P’s 
there replaced by our present Q,. Of course this Q, depends on x, and will 
fluctuate, even wildly, as m increases. But if you let n go to infinity, will the 
sequence of ratios Q,(A) “‘settle down to a steady value’? Such a question 
can never be answered empirically, since by the very nature of a limit we 
cannot put an end to the trials. So it is a mathematical idealization to assume 
that such a limit does exist, and then write 


(2.1.10) Q(A) = lim Q,(A). 


We may call this the empirical /imiting frequency of the event A. If you know 
how to operate with limits then you can see easily that the relation (2.1.9) 
remains true ‘“‘in the limit.”” Namely when we let n — © everywhere in that 
formula and use the definition (2.1.10), we obtain (2.1.2) with P replaced by Q. 
Similarly (2.1.4) and (2.1.5) also hold in this context. 

But the limit Q still depends on the actual sequence of trials which are 
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carried out to determine its value. On the face of it, there is no guarantee 
whatever that another sequence of trials, even if it is carried out under the 
same circumstances, will yield the same value. Yet our intuition demands that 
a measure of the likelihood of an event such as A should tell something more 
than the mere record of one experiment. A viable theory built on the fre- 
quencies will have to assume that the Q defined above is in fact the same for 
all similar sequences of trials. Even with the hedge implicit in the word 
“similar,” that is assuming a lot to begin with. Such an attempt has been 
made with limited success, and has a great appeal to common sense, but we 
will not pursue it here. Rather, we will use the definition in (2.1.7) which 
implies that if A is any subset of Q and |A| its size, then 


_ |Al _ JAI 

(2.1.11) P(A) = Tol =" 

For example, if A is the event ‘‘to throw an odd number,” then A is identified 
with the set {1, 3, 5} and P(A) = 3/6 = 1/2. 

It is a fundamental proposition in the theory of probability that under 
certain conditions (repeated independent trials with identical die), the limiting 
frequency in (2.1.11) will indeed exist and be equal to P(A) defined in (2.1.11), 
for ‘“‘practically all’? conceivable sequences of trials. This celebrated theorem, 
called the Law of Large Numbers, is considered to be the cornerstone of all 
empirical sciences. In a sense it justifies the intuitive foundation of probability 
as frequency discussed above. The precise statement and derivation will be 
given in Chapter 7. We have made this early announcement to quiet your 
feelings or misgivings about frequencies and to concentrate for the moment 
on sets and probabilities in the following sections. 


2.2. Definition and illustrations 


First of all, a probability is a number associated with or assigned to a set in 
order to measure it in some sense. Since we want to consider many sets at 
the same time (that is why we studied Chapter 1), and each of them will have 
a probability associated with it, this makes probability a “function of sets.” 
You should have already learned in some mathematics course what a function 
means, in fact this notion has been used a little in Chapter 1. Nevertheless, 
let us review it in the familiar notation: a function f defined for some or all 
real numbers is a rule of association, by which we assign the number f(x) to 
the number x. It is sometimes written as f(-), or more painstakingly as 
follows: 


(2.2.1) ff: x >f (x). 


So when we say a probability is a function of sets we mean a similar associ- 
ation, except that x is replaced by a set S: 


(2.2.2) P: S— P(S). 


oy) Probability 


P(B) P(C) 


= ee ee ee 


Figure 11 


The value P(S) is still a number, indeed it will be a number between 0 and 1. 
We have not been really precise in (2.2.1), because we have not specified the 
set of x there for which it has a meaning. This set may be the interval (a, b) 
or the half line (0, ©) or some more complicated set called the domain of f. 
Now what is the domain of our probability function P? It must be a set of 
sets, or to avoid the double usage, a family (class) of sets. As in Chapter 1 
we are talking about subsets of a fixed sample space Q. It would be nice if 
we could use the family of a// subsets of 2, but unexpected difficulties will 
arise in this case if no restriction is imposed on Q. We might say that if Q is 
too large, namely when it contains uncountably many points, then it has too 
many subsets, and it becomes impossible to assign a probability to each of 
them and still satisfy a basic rule (Axiom (ii*) below) governing the assign- 
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ments. However, if is a finite or countably infinite set then no such trouble 
can arise and we may indeed assign a probability to each and all of its subsets. 
This will be shown at the beginning of §2.4. You are supposed to know what 
a finite set is (although it is by no means easy to give a logical definition, 
while it is mere tautology to say that “‘it has only a finite number of points’’); 
let us review what a countably infinite set is. This notion will be of sufficient 
importance to us, even if it only lurks in the background most of the time. 


+ 2) B---B--- —— 
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Figure 12 


A set is countably infinite when it can be put into 1-to-1 correspondence 
with the set of positive integers. This correspondence can then be exhibited 
by labeling the elements as {5), 50,..., Sn, ...}. There are of course many 
ways of doing this, for instance we can just let some of the elements swap 
labels (or places if they are thought of being laid out in a row). The set of 
positive rational numbers is countably infinite, hence they can be labeled in 


some way as {ri, re, ..., rn, .. .+ but don’t think for a moment that you can 
do this by putting them in increasing order as you can with the positive 
integers 1 << 2 < --- <n <_---. From now on we shall call a set countable 


when it is either finite or countably infinite. Otherwise it is called uncountable. 
For example, the set of all real numbers is uncountable. We shall deal with 
uncountable sets later, and we will review some properties of a countable set 
when we need them. For the present we will assume the sample space Q to be 
countable in order to give the following definition in its simplest form, without 
a diverting complication. As a matter of fact, we could even assume © to be 
finite as in Examples (a) to (e) of §1.1, without losing the essence of the 
discussion below. 


Definition. A probability measure on the sample space Q is a function of sub- 
sets of Q satisfying three axioms: 
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(i) For every set A CQ, the value of the function is a non-negative 
number: P(A) > 0. 

(ii) For any two disjoint sets A and B, the value of the function for their 
union A + B is equal to the sum of its value for A and its value 
for B: 


P(A + B) = P(A) + P(B) provided AB= @., 
(i111) The value of the function for Q (as a subset) is equal to 1: 
PQ) = 1. 


Observe that we have been extremely careful in distinguishing the function 
P(-) from its values such as P(A), P(B), P(A + B), P(Q). Each of these is 
“a probability,” but the function itself should properly be referred to as a 
“‘probability measure’’ as indicated. 

Example | in §2.1 shows that the proportion P defined there is in fact a 
probability measure on the sample space, which is a bushel of 550 apples. It 
assigns a probability to every subset of these apples and this assignment 
satisfies the three axioms above. In Example 2 if we take Q to be all the land 
that belonged to the Pharaoh, it is unfortunately not a countable set. Never- 
theless we can define the area for a very large class of subsets which are called 
“measurable,” and if we restrict ourselves to these subsets only, the ‘‘area 
function” is a probability measure as shown in Example 2 where this restric- 
tion is ignored. Note that Axiom (iii) reduces to a convention: the decree of 
a unit. Now how can a land area not be measurable? While this is a sophisti- 
cated mathematical question which we will not go into in this book, it is easy 
to think of practical reasons for the possibility: the piece of land may be too 
jagged, rough or inaccessible. (See Fig. 13 on page 25) 

In Example 3 we have shown that the empirical relative frequency is a 
probability measure. But we will not use this definition in this book. Instead, 
we will use the first definition given at the beginning of Example 3, which is 
historically the earliest of its kind. The general formulation will now be given. 


Example 4. A classical enunciation of probability runs as follows. The prob- 
ability of an event is the ratio of the number of cases favorable to that event 
to the total number of cases, provided that all these are equally likely. 

To translate this into our language: the sample space is a finite set of 


possible cases: {wi, wo, ..., Wm}, each w, being a “‘case.”” An event A is a 
subset {w., Wi. --- 5 @,}, each w, being a “favorable case.” The probability 
of A is then the ratio 

_|Al_n 
(2.2.3) P(A) = Io] > m 


As we see from the discussion in Example 1, this defines a probability measure 
P on Q anyway, so that the stipulation above that the cases be equally likely 
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Figure 13 


is superfluous from the axiomatic point of view. Besides, what does it really 
mean? It sounds like a bit of tautology, and how is one going to decide 
whether the cases are equally likely or not? 

A celebrated example will illustrate this. Let two coins be tossed. 
D’Alembert (mathematician, philosopher and encyclopedist, 1717-83) argued 
that there are three possible cases, namely: 


(i) both heads, (ii) both tails, (iii) a head and a tail. 


So he went on to conclude that the probability of ‘a head and a tail’’ is 
equal to 1/3. If he had figured that this probability should have something 
to do with the experimental frequency of the occurrence of the event, he might 
have changed his mind after tossing two coins more than a few times. (History 
does not record if he ever did that, but it is said that for centuries people 
believed that men had more teeth than women because Aristotle had said so, 
and apparently nobody bothered to look into a few mouths.) For the three 
cases he considered are not equally likely. Case (iii) should be split into two: 


(iiia) first coin shows head and second coin shows tail. 


(iiib) first coin shows tail and second coin shows head. 


It is the four cases (i), (ii), (iiia) and (iiib) that are equally likely by sym- 
metry and on empirical evidence. This should be obvious if we toss the 
two coins one after the other rather than simultaneously. However, there is 
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an important point to be made clear here. The two coins may be physi- 
cally indistinguishable so that in so far as actual observation is concerned, 
D’Alembert’s 3 cases are the only distinct patterns to be recognized. In the 
model of two coins they happen not to be equally likely on the basis of 
common sense and experimental evidence. But in an analogous model for 
certain microscopic particles, called Bose-Einstein statistics (see Exercise 24 
of Chapter 3), they are indeed assumed to be equally likely in order to explain 
some types of physical phenomena. Thus what we regard as “‘equally likely”’ 
is a matter outside of the axiomatic formulation. To put it another way, if 
we use (2.2.3) as our definition of probability then we are in effect treating 
the w’s as equally likely, in the sense that we count only their numbers and 
do not attach different weights to them. 


Example 5. If six dice are rolled, what is the probability that all show differ- 
ent faces? 

This is just Example (e) and (e’). It is stated elliptically on purpose to get 
you used to such problems. We have already mentioned that the total number 
of possible outcomes is equal to 6° = 46656. They are supposed to be all 
“equally likely’ although we never breathed a word about this assumption. 
Why, nobody can solve the problem as announced without such an assump- 
tion. Other data about the dice would have to be given before we could begin— 
which is precisely the difficulty when similar problems arise in practice. Now 
if the dice are all perfect, and the mechanism by which they are rolled is also 
perfect, which excludes any collusion between the movements of the several 
dice, then our hypothesis of equal likelihood may be justified. Such conditions 
are taken for granted in a problem like this when nothing is said about the 
dice. The solution is then given by (2.2.3) with n = 6° and m = 6! (see 
Example 2 in §3.1 for these computations): 


6! 720 _ 915432 


65 46656 
approximately. 

Let us note that if the dice are not distinguishable from each other, then 
to the observer there is exactly one pattern in which the six dice show different 
faces. Similarly, the total number of different patterns when six dice are rolled 
is much smaller than 6° (see Example 3 of §3.2). Yet when we count the 
possible outcomes we must think of the dice as distinguishable, as if they 
were painted in different colors. This is one of the vital points to grasp in 
the counting of cases; see Chapter 3. 

In some situations the equally likely cases must be searched out. This 
point will be illustrated by a famous historical problem called the “problem 
of points.” 


Example 6. Two players A and B play a series of games in which the proba- 
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bility of each winning a single game is equal to 1/2, irrespective [independent] 
of the outcomes of other games. For instance, they may play tennis in which 
they are equally matched, or simply play “‘heads or tails’? by tossing an 
unbiased coin. Each player gains a “‘point” when he wins a game, and nothing 
when he loses. Suppose that they stop playing when A needs 2 more points, 
and B needs 3 more points to win the stake. How should they divide it fairly? 

It is clear that the winner will be decided in 4 more games. For in those 
4 games either A will have won >2 points or B will have won >3 points, 
but not both. Let us enumerate all the possible outcomes of these 4 games 
using the letter A or B to denote the winner of each game: 


AAAA AAAB AABB ABBB BBBB 
AABA ABAB BABB 
ABAA ABBA BBAB 
BAAA BAAB BBBA 
BABA 
BBAA 


These are equally likely cases on grounds of symmetry. There are* (;) +- 


(5) + (5) = 11 cases in which A wins the stake; and (5) + (‘) = 5 


cases in which B wins the stake. Hence the stake should be divided in the 
ratio 11:5. Suppose it is $640U00; then 4 gets $44000, B gets $20000. [We 
are taking the liberty of using the dollar as currency; the U.S.A. did not exist 
at the time when the problem was posed. | 

This is Pascal’s solution in a letter to Fermat dated August 24, 1654. 
[Blaise Pascal (1623-62); Pierre de Fermat (1601-65); both among the greatest 
mathematicians of all time.] Objection was raised by a learned contemporary 
(and repeated through the ages) that the enumeration above was not reason- 
able, because the series would have stopped as soon as the winner was decided 
and not have gone on through all 4 games in some cases. Thus the real 
possibilities are as follows: 


AA ABBB 
ABA BABB 
ABBA BBAB 
BAA BBB 
BABA 

BBAA 


But these are not equally likely cases. In modern terminology, if these 10 cases 
are regarded as constituting the sample space, then 


* See (3.2.3) for notation used below. 


28 Probability 


P(AA) = , P(ABA) = P(BAA) = P(BBB) = . 
P(ABBA) = P(BABA) = P(BBAA) = P(ABBB) = 
P(BABB) = P(BBAB) = a 


since A and B are independent events with probability 1/2 each (see §2.4). 
If we add up these probabilities we get of course 


; 1,1, 1,1, 1 1 il 
P (A wins the stake) = 7+ 5+ jet gt yet 16 = Je 
; l l 1 1 _ 5 
P (B wins the stake) = Te + fe + 16 + 9 = 76 


Pascal did not quite explain his method this way, saying merely that “‘it 
is absolutely equal and indifferent to each whether they play in the natural 
way of the game, which is to finish as soon as one has his score, or whether 
they play the entire four games.” A later letter by him seems to indicate that 
he fumbled on the same point in a similar problem with three players. The 
student should take heart that this kind of reasoning was not easy even for 
past masters. 


2.3. Deductions from the axioms 


In this section we will do some simple “‘axiomatics.’’ That is to say, we shall 
deduce some properties of the probability measure from its definition, using 
of course the axioms but nothing else. In this respect the axioms of a mathe- 
matical theory are like the constitution of a government. Unless and until it 
is changed or amended, every Jaw must be made to follow from it. In mathe- 
matics we have the added assurance that there are no divergent views as to 
how the constitution should be construed. 

We record some consequences of the axioms in (iv) to (viii) below. First 
of all, let us show that a probability is indeed a number between 0 and 1. 


(iv) for any set A, we have 
P(A) <1. 


This is easy but you will see that in the course of deducing it we shall use 
all three axioms. Consider the complement A¢ as well as A. These two sets 
are disjoint and their union is Q: 


(2.3.1) A+ Ac=Q, 


So far, this is just set theory, no probability theory yet. Now use Axiom (ii) 
on the left side of (2.3.1) and Axiom (iii) on the right: 
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(2.3.2) P(A) + P(A) = PQ) = 1. 
Finally use Axiom (i) for A¢ to get 

P(A) = 1— P(A) <1. 


Of course the first inequality above is just Axiom (i). You might object 
to our slow pace above by pointing out that since A is contained in Q, it is 
obvious that P(A) < PQ) = 1. This reasoning is-certainly correct but.we still 
have to pluck it from the axioms, and that is the point of the little proof 
above. We can also get it from the following more general proposition. 


(v) For any two sets such that A C B, we have 
P(A) < P(B), and P(B — A) = P(B) — P(A). 


The proof is an imitation of the preceding one with B playing the role 
of 2. We have 


B=A-+(B— 4A) 
P(B) = P(A) + P(B — A) => P(A). 
The next proposition is such an immediate extension of Axiom (ii) that 


we could have adopted it instead as an axiom. 


(vi) For any finite number of disjoint sets Ay, ..., An, we have 


This property of the probability measure is called finite additivity. It is 
trivial if we recall what ‘‘disjoint’’ means and use (ii) a few times; or we may 
proceed by induction if we are meticulous. There is an important extension 
of (2.3.3) to a countable number of sets later, not obtainable by induction! 

As already checked in several special cases, there is a generalization of 
Axiom (ii), hence also of (2.3.3), to sets which are not necessarily disjoint. 
You may find it trite, but it has the dignified name of Boole’s inequality. Boole 
(1815-1864) was a pioneer in the “laws of thought” and author of Theories 
of Logic and Probabilities. 


(vii) For any finite number of arbitrary sets Ai, ..., An, we have 
(2.3.4) P(A, U +++ U An) S P(A1) + +++ + P(An). 


Let us first show this when n = 2. For any two sets A and B, we can write 
their union as the sum of disjoint sets as follows: 


(2.3.5) AUB=A+A‘B. 
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Now we can apply axiom (ii) to get 
(2.3.6) P(A U B) = P(A) + P(A‘B). 


Since A°B C B we can apply (v) to get (2.3.4). 

The general case follows easily by mathematical induction, and you should 
write it out as a good exercise on this method. You will find that you need 
the associative law for union of sets as well as that for the addition of 
numbers. 

The next question is the difference between the two sides of the inequality 
(2.3.4). The question is somewhat moot since it depends on what we want to 
use to express the difference. However, when 1 = 2 there is a clear answer. 


(viii) For any two sets A and B we have 
(2.3.7) P(A U B)+ P(A (1) B) = P(A) + P(B). 


This can be gotten from (2.3.6) by observing that A°B = B — AB, so that 
we have by virtue of (v): 


P(A U B) = P(A) + P(B — AB) = P(A) + P(B) — P(AB). 


which is equivalent to (2.3.7). Another neat proof is given in Exercise 12. 
We shall postpone a discussion of the general case until Section 6.2. In 
practice, the inequality is often more useful than the corresponding identity 
which is rather complicated. 
We will not quit formula (2.3.7) without remarking on its striking resem- 
blance to formula (1.4.8) of §1.4, which is repeated below for the sake of 
comparison: 


(2.3.8) lave t lung =the. 


There is indeed a deep connection between the pair, as follows. The proba- 
bility PCS) of each set S can be obtained from its indicator function Js by a 
procedure (operation) called “taking expectation” or “integration.” If we 
perform this on (2.3.8) term-by-term their result is (2.3.7). This procedure is 
an essential part of probability theory and will be thoroughly discussed in 
Chapter 6. See Exercise 19 for a special case. 

To conclude our axiomatics, we will now strengthen Axiom (ii) or its 
immediate consequence (vi), namely the finite additivity of P, into a new 
axiom. 


(ii*) Axiom of countable additivity. For a countably infinite collection of 
disjoint sets A,, k = 1, 2,..., we have 


(2.3.9) P (= As) _ x P(A,). 
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This axiom includes (vi) as a particular case, for we need only put 4, = @ 
for k > nin (2.3.9) to obtain (2.3.3). The empty set is disjoint from any other 
set including itself, and has probability zero (why?). If Q is a finite set, then 
the new axiom reduces to the old one. But it is important to see why (2.3.9) 
cannot be deduced from (2.3.3) by letting n — o. Let us try this by rewriting 
(2.3.3) as follows: 


(2.3.10) P(> As) = > P(A). 
k=1 k=1 


Since the left side above cannot exceed 1 for all n, the series on the right side 
must converge and we obtain 


(2.3.11) lim P (3 Ay) = lim P(A) = ¥ P(Ay). 
no \k=1 n0 k=1 k=1 
Comparing this established result with the desired result (2.3.9), we see that 
the question boils down to: 


N— 0 


which can be exhibited more suggestively as 


(2.3.12) lim P (= As) = P (tim > As): 


n— 0 n-~o k= 1 


See end of §1.3. (See Fig. 14 on page 32) 

Thus it is a matter of interchanging the two operations “lim” and “P” in 
(2.3.12), or you may say, “taking the limit inside the probability relation.” If 
you have had enough calculus you know this kind of interchange is often hard 
to justify and may be illegitimate or even invalid. The new axiom is created 
to secure it in the present case and has fundamental consequences in the theory 
of probability. 


2.4. Independent events 


From now on, a “probability measure” will satisfy Axioms (i), (1i*) and (iti). 
The subsets of Q to which such a probability has been assigned will also be 
called an event. 

We shall show how easy it is to construct probability measures for any 
countable space Q = {w1, wo, ..., wn,...}. To each sample point w, let us 
attach an arbitrary “weight” p, subject only to the conditions: 


(2.4.1) Vn: pn =0;>d5 pr = 1. 


32 Probability 


oo 


2 A, 
n=1 


Figure 14 


This means that the weights are positive or zero, and add up to 1 altogether. 
Now for any subset A of Q, we define its probability to be the sum of the 
weights of all the points in it. In symbols, we put first 


(2.4.2) Vn: P({on}) = Pr} 


and then for every A C 2: 


P(A) = LD Po = LL P({on}). 
wn€_A wn A 
We may write the last term above more neatly as 


(2.4.3) P(A) = & PC{s}). 


CA 


Thus P is a function defined for all subsets of 2 and it remains to check that 
it satisfies axioms (i), (ii*) and (iii). This requires nothing but a bit of clear- 
headed thinking and is best done by yourself. Since the weights are quite 
arbitrary apart from the easy conditions in (2.4.1), you see that probability 
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measures come ‘“‘a dime a dozen” in a countable sample space. In fact, we can 
get them all by the above method of construction. For if any probability 
measure P is given, never mind how, we can define p, to be P({w,}) as a 
(2.4.2), and then P(A) must be given as in (2.4.3), because of Axiom (11*). 
Furthermore the p,,’s will satisfy (2.4.1) as a simple consequence of the axioms. 
In other words, any given P is necessarily of the type described by our 
construction. 

In the very special case that Q is finite and contains exactly m points, we 
may attach equal weights to all of them, so that 


I n=1,2,...,m. 
m 


Pn = 
Then we are back to the “‘equally likely’’ situation in Example 4 of §2.2. But 
in general the p,’s need not be equal, and when Q is countably infinite they 
cannot be all equal (why?). The preceding discussion shows the degree of 
arbitrariness involved in the general concept of a probability measure. 

An important model of probability space is that of repeated independent 
trials: This is the model used when a coin is tossed, a die thrown, a card 
drawn from a deck (with replacement) several times. Alternately, we may 
toss several coins or throw several dice at the same time. Let us begin with 
an example. 


Example 7. First toss a coin, then throw a die, finally draw a card from a deck 
of poker cards. Each trial produces an event; let 


A = coin falls heads; 
B = die shows number 5 or 6; 


C = card drawn is a spade, 


Assuming that the coin is fair, the die is perfect and the deck thoroughly 
shuffled. Furthermore assume that these three trials are carried out “‘inde- 
pendently” of each other, which means intuitively that the outcome of each 
trial does not influence that of the others. For instance this condition is 
approximately fulfilled if they are done by different people in different places, 
or by the same person in different months! Then all possible joint outcomes 
may be regarded as equally likely. There are respectively 2, 6 and 52 possible 
cases for the individual trials, and the total number of cases for the whole 
set of trials is obtained by multiplying these numbers together: 2-6-52 (as 
you will soon see it is better not to compute this product). This follows from 
a fundamental rule of counting which is fully discussed in §3.1, and which 
you should read now if need be. [In general, many parts of this book may 
be read in different orders, back and forth.] The same rule yields the numbers 
of favorable cases to the events A, B, C, AB, AC, BC, ABC given below, 
where the symbol |. . .| for size is used: 
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|A| = 1-6-52, |B] = 2-2-52, |C] = 2-6-13, 
|AB| = 1-2-52, |AC| = 1-6-13, |BC| = 2-2-13, 
|ABC| = 1-2-13. 


Dividing these numbers by |Q| = 2-6-52, we obtain after quick cancellation 
of factors: 


1 1 1 
PA=5 PB)=y P(C)= 7 
1 1 1 
P(AB) = @ P(AC) = = P(BC) = 75) 
P(ABC) = ay 


We see at a glance that the following set of equations hold: 


(2.4.4) P(AB) = P(A)P(B), P(AC) = P(A)P(C), P(BC) = P(B)P(C) 
P(ABC) = P(A)P(B)P(C). 


The reader is now asked to convince himself that this set of relations will 
also hold for any three events A, B, C such that A is determined by the coin, 
B by the die and C by the card drawn alone. When this is the case we say that 
these trials are stochastically independent as well as the events so produced. 
The adverb “stochastically” is usually omitted for brevity. 

The astute reader may observe that we have not formally defined the word 
“trial,”’ and yet we are talking about independent trials! A logical construc- 
tion of such objects is quite simple but perhaps a bit too abstract for casual 
introduction. It is known as “product space’’; see Exercise 29. However, it 
takes less fuss to define “independent events” and we shall do so at once. 

Two events A and B are said to be independent if we have P(AB) = 
P(A)P(B). Three events A, B and C are said to be independent if the relations 
in (2.4.4) hold. Thus independence is a notion relative to a given probability 
measure (by contrast, the notion of disjointness e.g. does not depend on any 
probability). More generally, the n events A), Ao, ..., An are independent if 
the intersection [joint occurrence] of any subset of them has as its probability 
the product of probabilities of the individual events. If you find this sentence 
too long and involved, you may prefer the following symbolism. For any 
subset (i, io, ..., i) of (1, 2,...,), we have 


(2.4.5) P(A, C) An OQ +++ C\ An) = P(Aun)P(Ai) +++ P(An)- 


Of course here the indices i;,..., 7, are distinct and 1 < k <n. 

Further elaboration of the notion of independence is postponed to §5.5, 
because it will be better explained in terms of random variables. But we shall 
describe briefly a classical scheme—the grand daddy of repeated trials, and 
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subject of intensive and extensive research by J. Bernoulli, De Moivre, 
Laplace,..., Borel,.... 


Example 8. (The coin-tossing scheme). A coin is tossed repeatedly n times. 
The joint outcome may be recorded as a sequence of AH’s and T’s, where 
H = “head,” T = “tail.’’ It is often convenient to quantify by putting H = 1, 
T = 0; or H = 1, T = —1; we shail adopt the first usage here. Then the 
result is a sequence of 0’s and 1’s consisting of n terms such as 110010110 
with n = 9. Since there are 2 outcomes for each trial, there are 2” possible 
joint outcomes. This is another application of the Fundamental Rule in §3.1. 
If all of these are assumed to be equally likely so that each particular joint 
outcome has probability 1/2” then we can proceed as in Example 7 to verify 
that the trials are independent and the coin is fair. You will find this a dull 
exercise but it is recommended that you go through it in your head if not on 
paper. However, we will turn the table around here by assuming at the outset 
that the successive tosses do form independent trials. On the other hand, we 
do not assume the coin to be “‘fair,”’ but only that the probabilities for head 
(A) and tail (T) remain constant throughout the trials. Empirically speaking, 
this is only approximately true since things do not really remain unchanged 
over long periods of time. Now we need a precise notation to record compli- 
cated statements, ordinary words being often awkward or ambiguous. Let 
then X, denote the outcome of the 7** trial and let «, denote 0 or 1 for each i, 
but of course varying with the subscript. Then our hypothesis above may be 
written as follows: 


(2.46) PX,=)D=p, PX,=0)=1-—p, i= 1,2,...,n; 


where p is the probability of head for each trial. For any particular, namely 
completely specified, sequence (e, €2,..., €n) of 0’s and 1’s, the probability 
of the corresponding sequence of outcomes is equal to 


P(X = a, Xo = €,..., Xn = &n) 


(2.4.7) 
P(X) = &)P(X2 = x)... P(Xn = &n) 


l 


as a consequence of independence. Now each factor on the right side above 
is equal to p or 1 — p according as the corresponding e, is 1 or 0. Suppose j 
of these are 1’s and n — j are 0’s; then the quantity in (2.4.7) 1s equal to 


(2.4.8) pil — p)*-? 


Observe that for each sequence of trials, the number of heads is given by 
the sum >> X,. It is important to understand that the number in (2.4.8) is 
w=] 


not the probability of obtaining j heads in n tosses, but rather that of ob- 
taining a specific sequence of heads and tails in which there are j heads. In 
order to compute the former probability, we must count the total number of 
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the latter sequences since all of them have the same probability given in 


(2.4.8). This number is equal to-the binomial coefficient ("): see §3.2 for a 


, ; n ; 
full discussion. Each one of these (") sequences corresponds to one possi- 


bility of obtaining 7 heads in x trials, and these possibilities are disjoint. 
Hence it follows from the additivity of P that we have 


P (= X,= i) = P (exactly j heads in n trials) 
2=1 


= (") P (any specified sequence of 7 trials with exactly 7 heads) 


=(")pa - py, 


This famous result is known as Bernoulli’s formula. We shall return to it many 
times in the book. 


2.5." Arithmetical density 


We study in this section a very instructive example taken from arithmetic. 


Example 9. Let © be the first 120 natural numbers {1, 2,..., 120}. For the 
probability measure P we use the proportion as in Example | of §2.1. Now 
consider the sets 


= {w|w is a multiple of 3} 


B= {w|w is a multiple of 4}. 


Then every third number of Q belongs to A, and every fourth to B. Hence 
we get the proportions: 


P(A) = 1/3, P(B) = 1/4. 


What does the set AB represent? It is the set of integers which are divisible 
both by 3 and by 4. If you have not forgotten entirely your school arithmetic, 
you know this is just the set of multiples of 3-4 = 12. Hence P(AB) = 1/12. 
Now we can use (viii) to get P(A LU B): 


(2.5.1) P(A U B) = P(A) + P(B) — P(AB) = 1/3 + 1/4 — 1/12 = 1/2. 


What does this mean? A  B is the set of those integers in Q which are 
divisible by 3 or by 4 (or by both). We can count them one by one, but if 


* This section may be omitted. 
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you are smart you see that you don’t have to do this drudgery. All you have 
to do is to count up to 12 (which is ten percent of the whole population Q), 
and check them off as shown: 


1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 
VV VV VV V 
V 


There are 6 checked (one checked twice), hence the proportion of AU B 
among these 12 is equal to 6/12 = 1/2 as given by (2.5.1). 
An observant reader will have noticed that in the case above we have also 


P(AB) = 1/12 = 1/3-1/4 = P(A)-P(B). 


This is true because the two numbers 3 and 4 happen to be relatively prime, 


namely they have no common divisor except 1. Suppose we consider an- 
other set: 


C = {w|w is a multiple of 6}. 


Then P(C) = 1/6 but what is P(BC) now? The set BC consists of those 
integers which are divisible by both 4 and 6, namely divisible by their least 
common multiple (remember that?) which is 12 and not the product 4-6 = 24. 
Thus P(BC) = 1/12. Furthermore, because 12 is the least common multiple 
we can again stop counting at 12 in computing the proportion of the set 
BU C. An actual counting gives the answer 4/12 = 1/3, which may also be 
obtained from the formula (2.3.7): 


(2.5.2) P(BU C) = P(B) + P(C) — P(BC) = 1/4 + 1/6 — 1/12 = 1/3. 


This example illustrates a point which arose in the discussion in Example 3 
of §2.1. Instead of talking about the proportion of the multiples of 3, 
say, we can talk about its frequency. Here no rolling of any fortuitous dice 
is needed. God has given us those natural numbers (a great mathematician 
Kronecker said so), and the multiples of 3 occur at perfectly regular periods 
with the frequency 1/3. In fact, if we use N,(A) to denote the number of 
natural numbers up to and including n which belong to the set A, it is a 
simple matter to show that 


Let us call this P(A), the limiting frequency of A. Intuitively, it should repre- 
sent the chance of picking a number divisible by 3, if we can reach into the 
whole bag of natural numbers as if they were so many indistinguishable balls 
in an urn. Of course similar limits exist for the sets B, C, AB, BC, etc. and 
have the values computed above. But now with this infinite sample space of 
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“all natural numbers,” call it Q*, we can treat by the same method any set 
of the form 


(2.5.3) Am = {w|w is divisible by m} 


where m is an arbitrary natural number. Why then did we not use this more 
natural and comprehensive model? 

The answer may be a surprise for you. By our definition of probability 
measure given in §2.2, we should have required that every subset of Q* has 
a probability, provided that Q* is countable which is the case here. Now take 
for instance the set which consists of the single number {1971} or if you 
prefer the set Z = {all numbers from 1 to 1971}. Its probability is given by 
lim N,(Z)/n according to the same rule that was applied to the set A. But 


no 


N,(Z) 18 equal to 1971 for all values of n > 1971, hence the limit above is 
equal to 0 and we conclude that every finite set has probability 0 by this rule. 
If P were to be countably additive as required by Axiom (ii*) in §2.3, then 
P(Q*) would be 0 rather than 1. This contradiction shows that P cannot be a 
probability measure on 0*. Yet it works perfectly well for sets such as An. 

There is a way out of this paradoxical situation. We must abandon our 
previous requirement that the measure be defined for all subsets (of natural 
numbers). Let a finite number of the sets A, be given, and let us consider 
the composite sets which can be obtained from these by the operations: 
complementation, union and intersection. Call this class of sets the class 
generated by the original sets. Then it is indeed possible to define P in the 
manner prescribed above for all sets in this class. A set which is not in the 
class has no probability at all. For example, the set Z does not belong to 
the class generated by A, B, C. Hence its probability is not defined, rather 
than zero. We may also say that the set Z is nonmeasurable in the context of 
Example 2 of §2.1. This saves the situation but we will not pursue it further 
here except to give another example. 


Example 10. What is the probability of the set of numbers divisible by 3, not 
divisible by 5, and divisible by either 4 or 6? 

Using the preceding notation, the set in question is ADB \U C), where 
D = A;. Using distributive law, we can write this as AD°B\ AD°C. We 
have also 


(A D°B)\AD°C) = AD°BC = ABC — ABCD. 
Hence by (v), 


P(AD°BC) = P(ABC) — P(ABCD) = + — Me ~ i 


Similarly, we have 


Exercises 39 


P(AD'B) = P(AB) ~ P(ABD) = ty — = = 


60 60. 15° 
1 1 4 2 
P(AD*C) = P(AC) ~ P(ACD) = 6 — = 3 = Te 


Finally we obtain by (viii): 


P(AD°B\U ADC) = P(AD‘B) + P(AD°C) — P(AD*BC) 
1 
“i + 
You should check this using the space Q in Example 9. 
The problem can be simplified by a little initial arithmetic, because the set 
in question is seen to be that of numbers divisible by 2 or 3 and not by 5. Now 
our method will yield the answer more quickly. 


Exercises 


1. Consider Example 1 in Section 2.1. Suppose that each good apple costs 
l¢ while a rotten one costs nothing. Denote the rotten ones by R, an 
arbitrary bunch from the bushel by S, and define 


O(S) = |S\RI/|Q — RI. 


Q is the relative value of S, with respect to that of the bushel. Show 
that it is a probability measure. 

2. Suppose that the land of a square kingdom is divided into three strips 
A, B, C of equal area and suppose the value per unit is in the ratio of 
1:3:2. For any piece of (measurable) land S in this kingdom the relative 
value with respect to that of the kingdom is then given by the formula: 


15) = P(SA) + SP(SB) + 2P(SC) 


where P is as in Example 2 of §2.1. Show that V is a probability measure. 

3.* Generalizing No. 2, let a, ..., ad, be arbitrary positive numbers and 
let Ay + --- + A, = Q be an arbitrary partition. Let P be a probability 
measure on 2 and 


Q(S) = [avP(SAi) + +++ ++ GnP(SAn)]/[aP(A1) + +++ + anP(An)] 


for any subset of 2. Show that P is a probability measure. 

4. Suppose the first cup of coffee C; costs 15¢, and a second cup C; costs 
10¢. Using P to denote “‘price,’’ write down a formula like Axiom (ii) 
but with an inequality (P is “‘subadditive’’). 
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7* 


8.* 


13. 


15.* 


16. 


17,* 


Probability 


Suppose that on a shirt sale each customer can buy two shirts at $4 
each, but the regular price is $5. A customer bought 4 shirts Si,.. ., Su. 
Write down a formula like Axiom (ii) and contrast with Example 3. 
Forget about sales tax! (P 1s “‘superadditive.”’) 

Show that if P and Q are two probability measures defined on the same 
(countable) sample space, then aP + bQ is also a probability measure 
for any two nonnegative numbers a and b satisfying a + b = 1. Give a 
concrete illustration of such a mixture. 

If P is a probability measure, show that the function P/2 satisfies 
Axioms (i) and (ii) but not (iii). The function P? satisfies (i) and (iii) but 
not necessarily (ii); give a counterexample to (ii) by using Example 1. 
If A, B, C are arbitrary sets, show that 


(a) P(A (1) B() C) S P(A) A PB) A PCC); 
(b) P(AU BUC) > P(A) V P(B) V P(C). 


Prove that for any two sets A and B, we have 
P(AB) => P(A) + P(B) — 1. 


Give a concrete example of this inequality. [Hint: Use (2.3.4) withn = 2 
and DeMorgan’s laws. | 

We have A () A = A but when is P(A)-P(A) = P(A)? Can P(A) = 0 
but A # @? 

Find an example where P(AB) < P(A)P(B). 

Prove (2.3.7) by first showing that 


(AU B)— A= B—(A()B). 


Two groups share some members. Suppose that Group A has 123, 
Group B has 78 members, and the total membership in both groups is 
184. How many members belong to both? 

Groups A, B, C have respectively 57, 49, 43 members. A and B have 13, 
A and C have 7, B and C have 4 members in common; and there is a 
lone guy who belongs to all three groups. Find the total number of 
people in all three groups. 

Generalize Example 14 when the various numbers are arbitrary, but of 
course subject to certain obvious inequalities. The resulting formula, 
divided by the total population (there may be many non-joiners!) 1s the 
extension of (2.3.7) ton = 3. 

Compute P(A A 8B), in terms of P(A), P(B) and P(AB); also in terms of 
P(A), P(B) and P(A LU B). 

Using the notation (2.5.3) and the probability defined in that context, 
show that for any two m and n we have 
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18.* 


19,* 


20. 


21.* 


22. 


23.* 


24. 


25. 


26. 


27. 


P(AmAn) 2 P(Am)P(An). 


When is there equality above? 
Recall the computation of plane areas by double integration in calculus; 
for a nice figure such as a parallelogram, trapezoid or circle we have 


Area of S = I 1 dx dy. 
8 


Show that this can be written in terms of the indicator J, as 


A(S) = ff Is(x, y) dx dy, 


where Q is the whole plane and J(x, y) is the value of the function Js 
for (at) the point (x, ») (denoted by w in §1.4). Show also that for two 
such figures S, and S., we have 


A(S:) + A(S:) = ff Us: + I) 


where we have omitted some unnecessary symbols. 

Now you can demonstrate the connection between (2.3.7) and (2.3.8) 
mentioned there, in the case of plane areas. 

Find several examples of {p,} satisfying the conditions in (2.4.1); give at 
least two in which all p, > 0. 

Deduce from Axiom (ii*) the following two results. (a) If the sets A, 
are nondecreasing, namely A, C Ani for all n = 1, and A, = U An, 


then P(A,,) = lim P(A,). (b) If the sets A, are nonincreasing, namely 
An > Any for all n > 1, and A, = ()\ A,, then P(A.) = lim P(A,). 


(Hint: For (a), consider 4; + (Az — Ai) + (As — Ae) + +++ for (b), 
dualize by complementation. | : 

What is the probability (in the sense of Example 4) that a natural 
number picked at random is not divisible by any of the numbers 3, 4, 6 
but is divisible by 2 or 5? 

Show that if (7, . . . , M,) are co-prime positive integers, then the events 
(Amy » ++ 5 Am) defined in §2.5 are independent. 

What can you say about the event A if it is independent of itself? If the 
events A and B are disjoint and independent, what can you say of them? 
Show that if the two events (A, B) are independent, then so are (A, B°), 
(A’, B) and (4°, B’). Generalize this result to three independent events. 
Show that if A, B, C are independent events, then A and B\ C are 
independent, also A\B and C are independent. 

Prove that 
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28. 


29.* 


Probability 


P(AU BU C) = P(A) + P(B) + P(C) — P(AB) — P(AC) 
— P(BC) + P(ABC) 


when A, B, C are independent by considering P(A°B°C*). [The formula 
remains true without the assumption of independence; see §6.2. ] 
Suppose 5 coins are tossed; the outcomes are independent but the prob- 
ability of head may be different for different coins. Write the probability 
of the specific sequence HHTHT, and the probability of exactly 3 heads. 
How would you build a mathematical model for arbitrary repeated 
trials, namely without the constraint of independence? In other words, 
describe a sample space suitable for recording such trials. What is the 
mathematical definition of an event which is determined by one of the 
trials alone, two of them, etc.? You do not need a probability measure. 
Now think how you would cleverly construct such a measure over the 
space in order to make the trials independent. The answer is given in 
e.g. [Feller 1, §V.4], but you will understand it better if you first give 
it a try yourself. 


Chapter 3 


Counting 


3.1. Fundamental rule 


The calculation of probabilities often leads to the counting of various 
possible cases. This has been indicated in Examples 4 and 5 of §2.2 and 
forms the backbone of the classical theory with its stock in trade the games 
of chance. But combinatorial techniques are also needed in all kinds of 
applications arising from sampling, ranking, partitioning, allocating, pro- 
gramming and model building, to mention a few. In this chapter we shall 
treat the most elementary and basic types of problems and the methods of 
solving them. 

The author has sometimes begun a discussion of ‘“‘permutations and 
combinations” by asking in class the following question. If a man has three 
shirts and two ties, in how many ways can he dress up [put on one of each]? 
There are only two numbers 2 and 3 involved, and it’s anybody’s guess that 
one must combine them in some way. Does one add: 2 + 3? or multiply: 
2 X 3? (or perhaps make 23 or 32). The question was meant to be rhetorical 
but experience revealed an alarming number of wrong answers. So if we 
dwell on this a little longer than you deem necessary you will know why. 

First of all, in a simple example like that, one can simply picture to oneself 
the various possibilities and count them up mentally: 


Figure 15 
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A commonly used tabulation 1s as follows: 


(3.1.1) 


As mathematics is economy of thought we can schematize (program) this in 
a more concise way: 


(51, ty) (S1, te) (S2, tr) (Se, t2) (Ss, £1) (53, t2) 


and finally we see that it is enough just to write 
(3.1.2) (1, 1) Cl, 2) @, 1) @, 2) G, 1) G, 2) 


by assigning the first slot to “‘shirt’’ and the second to “tie.”” Thus we have 
reached the mathematical method of naming the collection in (3.1.2). It is 
the set of all ordered couples (a, b) such that a = 1,2; b = 1,2, 3; and you 
see that the answer to my question is 2 X 3 = 6. 

In general we can talk about ordered k-tuples (aq, . . . , a.) where for each 
j from | to k, the symbol a, indicates the assignment (choice) for the jth slot, 
and it may be denoted by a numeral between 1 and m,;. In the example above 
k = 2,m, = 3, m. = 2, and the collection of all (a, a.) is what is enumerated 
in (3.1.2). 

This symbolic way of doing things is extremely convenient. For instance, 
if the man has also two pairs of shoes, we simply extend each 2-tuple to a 
3-tuple by adding a third slot into which we can put either “1” or “2”. Thus 
each of the original 2-tuples in (3.1.2) splits into two 3-tuples, and so the 
total of 3-tuples will be 3 & 2 K 2 = 12. This is the number of ways the man 
can choose a shirt, a tie and a pair of shoes. You see it is all automated as on 
a computing machine. As a matter of fact, it is mathematical symbolism that 
taught the machines, not the other way around (at least, not yet). 

The idea of splitting mentioned above lends well to visual imagination. 
It shows why 3 “shirts”? mu/tiply into 6 “‘shirt-ties’’ and 12 “‘shirt-tie-shoes.”’ 
Take a good look at it. Here is the general proposition: 


Fundamental Rule. A number of multiple choices are to be made. There are 
m, possibilities for the first choice, m,. for the second, ms; for the third, etc. 
If these choices can be combined freely, then the total number of possibilities 
for the whole set of choices is equal to 


m X Mm, X m3 X °°: 
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A formal proof would amount to repeating what is described above in 
more cut-and-dried terms, and is left to your own discretion. Let me point out 
however that “free combination’? means in the example above that no 
matching of shirt and ties is required, etc. 


Example 1. A menu in a restaurant reads like this: 


Choice of one: 
Soup, Juice, Fruit Cocktail 


Choice of one: 
Beef Hash 
Roast Ham 
Fried Chicken 
Spaghetti with Meat Balls 


Choice of one: 
Mashed Potatoes, Broccoli, Lima Beans 


Choice of one: 
Ice Cream, Apple Pie 


Choice of one: 
Coffee, Tea, Milk 


Suppose you take one of each “‘course”’ without substituting or skipping, how 
many options do you have? Or if you like the language nowadays employed 
in more momentous decisions of this sort, how many scenarios of a “‘complete 
5-course dinner” (as advertised) can you make out of this menu? The total 
number of items you see on the menu is 


34+44342+4+3 = 15. 


But you don’t eat them all. On the other hand, the number of different 
dinners available is equal to 


3X4X3X2X3 = 216, 


according to the Fundamental Rule. True, you eat only one dinner at a time, 
but it is quite possible for you to try all these 216 dinners if:you have catholic 
taste in food and patronize that restaurant often enough. More realistically 
and statistically significant: all these 216 dinners may be actually served to 
different customers over a period of time and perhaps even on a single day. 
This possibility forms the empirical basis of combinatorial counting and its 
relevance to computing probabilities. 


Example 2. We can now solve the problem about Example (e) and (e’) in 
Section 1.1: in how many ways can six dice appear when they are rolled? 
and in how many ways can they show all different faces? 
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Each die here represents a multiple choice of six possibilities. For the 
first problem these 6 choices can be freely combined so the rule applies 
directly to give the answer 6° = 46656. For the second problem the choices 
cannot be freely combined since they are required to be all different. Offhand 
the rule does not apply, but the reasoning behind it does. This is what counts 
in mathematics: not a blind reliance on a rule but a true understanding of its 
meaning. (Perhaps that is why “permutation and combination” is for many 
students harder stuff than algebra or calculus.) Look at the following splitting 
diagram (see Fig. 16). 

The first die can show any face, but the second must show a different one. 
Hence, after the first choice has been made, there are 5 possibilities for the 
second choice. Which five depends on the first choice but their number does 
not. So there are 6 X 5 possibilities for the first and second choices together. 
After these have been made, there are 4 possibilities left for the third, and so 
on. For the complete sequence of six choices we have therefore 6.5.4.3.2.1 = 
6! = 720 possibilities. By the way, make sure by analyzing the diagram that 
the first die hasn’t got a preferential treatment. Besides, which is “‘first’’? 

Of course, we can re-enunciate a more general rule to cover the situation 
just discussed, but is it necessary once the principles are understood? 


3.2. Diverse ways of sampling 


Let us proceed to several standard methods of counting which constitute the 
essential elements in the majority of combinatorial problems. These can be 
conveniently studied either as sampling or as allocating problems. We begin 
with the former. 

An urn contains m distinguishable balls marked 1 to m, from which n 
balls will be drawn under various specified conditions, and the number of all 
possible outcomes will be counted in each case. 


I. Sampling with replacement and with ordering. 

We draw n balls sequentially, each ball drawn being put back into the 
urn before the next drawing is made, and we record the numbers on the balls 
together with their order of appearance. Thus we are dealing with ordered 
n-tuples (a, . . . an) in which each a, can be any number from | to m. The 
Fundamental Rule applies directly and yields the answer m”. This corresponds 
to the case of rolling six dice without restriction, but the analogy may be 
clearer if we think of the same dice being rolled six times in succession, so 
that each rolling corresponds to a drawing. 


II. Sampling without replacement and with ordering. 

We sample as in Case I but after each ball is drawn it is left out of the urn. 
We are dealing with ordered n-tuples (a, . . . , dn) as above with the restriction 
that they be all different. Clearly we must have n < m. The Fundamental 
Rule does not apply directly but the splitting argument works as in Example 
2 and yields the answer 
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(3.2.1) m:-(m — 1)-(m — 2)---(m—n+ 1) = (™),. 


Observe that there are 1 factors on the left side of (3.2.1) and that the last 
factor is m — (n — 1) rather than m — n, why? We have introduced the 
symbol (m), to denote this “continued product” on the left side of (3.2.1). 

Case II has a very important subcase which can be posed as a “‘permuta- 
tion” problem. 


Ila. Permutation of m distinguishable balls. 

This is the case of II when m = n. Thus all m balls are drawn out one 
after another without being put back. The result is therefore just the m 
numbered balls appearing in some order, and the total number of such 
possibilities is the same as that of all possible arrangements (ordering, 
ranking, permuation) of the set {1,2,...,m}. This number is called the 
factorial of m and denoted by 


m! = (mM)n = mm — 1)--- 2-1 


n n! 
l l 
2 2 
3 6 
4 24 
5 120 
6 720 
7 5040 
8 40320 
9 362880 

10 3628800 


II. Sampling without replacement and without ordering. 

Here the balls are not put back and their order of appearance is not 
recorded; hence we might as well draw all n balls at one grab. We are dealing 
therefore with subsets of size n from a set (population) of size m. To count 
their number, we will compare with Case II where the balls are ranged in 
order. Now a bunch of n balls, if drawn one by one, can appear in n! different 
ways by Case Ila. Thus each unordered sample of size n produces n! ordered 
ones, and conversely every ordered sample of size n can be produced in this 
manner. For instance if m = 5, n = 3, the subset {3, 5, 2} can be drawn in 
3! = 6 ways as follows: 


(2, 3, 5) (2, 5; 3) G, 2, 5) G, 5, 2) (5, 2, 3) (5, 3, 2). 


In general we know from Case II that the total number of ordered samples 
of size n is (m),. Let us denote for one moment the unknown number of 
unordered samples of size n by x, then the argument above shows that 


3.2. Diverse ways of sampling 49 
nlx = (M)n. 


Solving for x, we get the desired answer which will be denoted by 


(3.2.2) (”’ _ Mn. 


n n'\ 


If we multiply both numerator and denominator by (m — n)!, we see from 
(3.2.1) that 


("") _ (m).(m — n)! 
| — ! 
(3.2.3) mn — 1)! 


_mm—1)---(m—n+ Im—n)-:- 2.1 _ mi. 
7 ni(m — n)! ~ al(m — n)! 


When n = m, there is exactly one subset of size n, namely the whole set, 
hence the number in (3.2.3) must reduce to 1 if it is to maintain its significance. 
So we are obliged to set 0! = 1. Under this convention, formula (3.2.3) holds 


for 0 <n < m. The number (”) is called a binomial coefficient and plays an 
important role in probability theory. Note that 


(3.2.4) ("") _ (. .) 


which is immediate from (3.2.3). It is also obvious without this explicit 
evaluation from the interpretation of both sides as countin g formulas (why?). 
The argument used in Case III leads to a generalization of Ila: 


IIIa. Permutation of m balls which are distinguishable by groups. 

Suppose that there are m, balls of color no. 1, mp, balls of color no. 2,..., 
m, balls of color no. r. Their colors are distinguishable but balls of the same 
color are not. Of course m, + m, + --+ + m, = m. How many distinguish- 
able arrangements of these m balls are there? 

For instance, if m, = m, = 2,m = 4 and the colors are black and white, 
there are 6 distinguishable arrangments as follows: 


@@0O0O @00@0 @800e@ C54#oed C5ece O0ee 


To answer the question in general, we compare with Case IIa where all 
balls are distinguishable. Suppose we mark the balls of color no. 1 from 1 to 
m, the balls of color no. 2 from 1 to m2, and so forth. Then they become all 
distinguishable and so the total number of arrangements after the markings 
will be m! by Case Ila. Now the m, balls of color no. 1 can be arranged in m! 
ways by their new marks, the m, balls of color no. 2 can be arranged in m,! 
ways by their new marks, etc. Each arrangement for one color can be freely 
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combined with any arrangement for another color. Hence according to the 
Fundamental Rule, there are altogether 


m,'!m,! cee m,\ 


new arrangements produced by the various markings, for each original 
unmarked arrangement. It follows as in the discussion of Case III that the 
total number of distinguishable unmarked arrangements is equal to the 
quotient 


m! 
m'!m!...m,! 


This is called a multinomial coefficient. When r = 2 it reduces to the binomial 


. m m 
coefficient = . 
pt (") () 


IV. Sampling with replacement and without ordering. 

We draw n balls one after another, each ball being put back into the urn 
before the next drawing is made, but we record the numbers drawn with 
possible repetitions as a lot without paying attention to their order of ap- 
pearance. This is a slightly more tricky situation so we will begin by a numeri- 
cal illustration. Take m = n = 3; all the possibilities in this case are listed in 
the first column below: 


Hl | Vv¥v| | VV Vv 
112 VV Vv VV VI 
113 Vv IV VV V 
122 V/NAN, VIVvV| 
123 Vi Viv Y| Viv 
(3.2.9) 133 V IV V VIIA, 
222 IVVv IV VV 
223 NAVA, IV VV 
233 )  Vivv IVIVV 
333 | IV VV |v 


Do you see the organization principle used in making the list? 
In general think of a “tally sheet” with numbers indicating the balls in 
the top line: 


eens | itnsistteteetatett ef per ee er ae 


After each drawing we place a check under the number (of the ball) which is 
drawn. Thus at the end of all the drawings the total number of checks on the 
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sheet will be n (which may be greater than m); there may be as many as that 
in an entry, and there may be blanks in some entries. Now economize by 
removing all dispensable parts of the tally sheet, so that column 1 in (3.2.5) 
becomes the skeleton in column 2. Check this over carefully to see that 
no information is lost in the simplified method of accounting. Finally align 
the symbols V and | in column 2 to get column 3. Now forget about columns 
1 and 2 and concentrate on column 3 for a while. Do you see how to recon- 
struct from each little cryptogram of “checks and bars” the original tally? 
Do you see that all possible ways of arranging 3 checks and 2 bars are listed 
there? Thus the total number is (by Case IIIa with m = 5, m = 3, m, = 2, 
or equivalently Case III with m = 5, n = 3) equal to 5!/3!2! = 10 as shown. 
This must therefore also be the number of all possible tally results. 

In general each possible record of sampling under Case IV can be trans- 
formed by the same method into the problem of arranging n checks and 
m — | bars (since m slots have m — | lines dividing them) in all possible 
ways. You will have to draw some mental pictures to convince yourself that 
there is one-to-one correspondence between the two problems as in the 
particular case illustrated above. From IIIa we know the solution to the 
second problem is 


(3.2.6) (” +n — ') _ (” +n ') 


n m— 1 


Hence this is also the total number of outcomes when we sample under 
Case IV. 


Example 3. D’Alembert’s way of counting discussed in Example 4 of Section 
2.1 is equivalent to sampling under Case IV with m = n = 2. Tossing each 
coin corresponds to drawing a head or a tail, if the results for two coins are 
tallied without regard to “which shows which” (or without ordering when 
the coins are tossed one after the other), then there are three possible out- 
comes: 

VV V|V |VV 

HH HT = TH TT 


Similarly, if six dice are rolled and the dice are not distinguishable, then the 
total number of recognizably distinct patterns is given by (3.2.6) with m = 


n = 6, namely 
6+6—1\_ fl1l\ _ 
( . ) = (4) = 462. 


This is less than 1% of the number 46656 under Case I. 
We will now illustrate in a simple numerical case the different ways of 
counting in the four sampling procedures: m = 4,n = 2. 
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Case (I) 

(1,1) €,2) d,3) 1,4) 

(2,1) (2,2) (2,3) (2,4) 


3,1) 3,2) 3,3) G4) 
(4,1) 4,2) 4,3) 4,4 


Case (II) Case (IV) 
1,2) d,3) (1,4) (1,1) €,2) d,3) d,4) 
(2, 1) (2,3) @,4) (2,2) (2,3) (2,4) 
(3,3) @G,4) 
(4, 4) 


(3,1) G, 2) (3, 4) 
(4,1) (4,2) (4, 3) 


Case (III) 


(1,2) (1,3) (1,4) 
(2,3) (2,4) 


(3, 4) 


3.3. Allocation models; binomial coefficients 


A source of inspiration as well as frustration in combinatorics is that the 
same problem may appear in different guises and it may take an effort to 
recognize their true identity. Sampling under Case (IV) is a case in point, 
another will be discussed under (IIIb) below. People have different thinking 
habits and often prefer one way to another. But it may be worthwhile to 
learn some other ways as we learn foreign languages. In the above we have 
treated several basic counting methods as sampling problems. Another 
formulation, prefered by physicists and engineers, is “putting balls into 
boxes.” Since these balls play a different role from those used above, we will 
call them tokens instead to simplify the translation later. 

There are m boxes labeled from 1 to m and there are n tokens which are 
numbered from | to n. The tokens are put into the boxes with or without the 
condition that no box can contain more than one token. We record the out- 
come of the allocation (occupancy) by noting the number of tokens in each 
box, with or without noting the labels on the tokens. The four cases below 
then correspond respectively with the four cases of sampling discussed above. 
I’. Each box may contain any number of tokens and the labels on the 

tokens are observed. 


II’. No box may contain more than one token and the labels on the tokens 
are observed. 
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III’. No box may contain more than one token and the labels on the tokens 
are not observed. 

IV’. Each box may contain any number of tokens and the labels on the 
tokens are not observed. 


It will serve no useful purpose to prove that the corresponding problems 
are really identical, for this is the kind of mental exercise one must go 
through by oneself to be convinced. (Some teachers even go so far as to say 
that combinatorial thinking cannot be taught.) However, here are the key 
words in the translation from one to the other description. 


Sampling Allocating 
Ball Box 
Number of drawing Number on token 
jth drawing gets ball no. k jth token put into box no. k 


In some way the new formulation is more adaptable in that further con- 
ditions on the allocation can be imposed easily. For instance, one may require 
that no box be left empty when n > m, or specify “loads” in some boxes. Of 
course these conditions can be translated into the other language, but they 
may then become less natural. Here is one important case of this sort which 
is just Case IIIa in another guise. 


IIIb. Partition into numbered groups. 

Let a population of m objects be subdivided into r subpopulations or just 
‘groups’: m, into group no. 1, m2 into group no. 2,..., m, into group no. 7; 
where m + --+ +m, = mand all m; = 1. This is a trivial paraphrasing of 
putting m tokens into r boxes so that m, tokens are put into box no. /j. It is 
important to observe that it is not the same as subdividing into r groups of 
Sizes 7m, .. ., m,; for the groups are numbered. A simple example will make 
this clear. 


Example 4. In how many ways can 4 people split into two pairs? 
The English language certainly does not make this question unambiguous, 
but offhand one would have to consider the following 3 ways as the answer: 


(3.3.1) (12)(34) —(13)(24) (1423). 


This is the correct interpretation if the two pairs are going to play chess or 
pingpong games and two equally good tables are available to both pairs. But 
now suppose the two pairs are going to play double tennis together and the 
“‘first’’ pair has the choice of side of the court, or will be the first to serve. It 
will then make a difference whether (12) precedes (34) or vice versa. So each 
case in (3.3.1) must be permuted by pairs into two orders and the answer is 
then the following 6 ways: 


(12)(34) (34)(12) (13)(24) (24)(13) (14)(23) (23)(14). 
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This is the situation covered by the general problem of partition under IIIb 
which will now be solved. 
Think of putting tokens (people) into boxes (groups). According to sam- 


pling Case III, there are ("") ways of choosing m, tokens to be put into 
1 


m — , 
box 1; after that, there are ( mn ™) ways of choosing m, tokens from the 
2 


remaining m — m, to be put into box 2; and so on. The Fundamental Rule 
does not appiy but its modification used in sampling Case II does, and so 
the answer is 


my m— m — Me (7m es 
my, Mo Ng Mm, 


_ m (m — m)! (m — m — m2)! 
~ mm — m)! m!(m — m, — mo)! m'(m — m, — m. — ma)! 
(3.3.2) i!( 1) o! ( 1 2) o! ( 1 2 3) 
(n= ms — m2)! 
m,!0! 
m! 
~ mim! ++ mt 


Observe that there is no duplicate counting involved in the argument, even 
if some of the groups have the same size as in the tennis player example above. 
This is because we have given numbers to the boxes (groups). On the other 
hand, we are not arranging the numbered groups in order (as the words 
‘ordered groups” employed by some authors would seem to imply). To clarify 
this essentially linguistic confusion let us consider another simple example. 


Example 5. Six mountain climbers decide to divide into three groups for 
the final assault on the peak. The groups will be of size 1, 2, 3 respectively 
and all manners of deployment are considered. What is the total number of 
possible grouping and deploying? 

The number of ways of splitting in G,, G2, Gs; where the subscript denotes 
the size of group, is given by (3.3.2): 


6! 

i213 — °°. 
Having formed these three groups, there remains the decision which group 
leads, which in the middle, and which backs up. This is solved by Case IIa: 
3! = 6. Now each grouping can be combined freely with each deploying, 
hence the Fundamental Rule gives the final answer 60-6 = 360. 

What happens when some of the groups have the same size? Think about 
the tennis players again. 

Returning to (3.3.2), this is the same multinomial coefficient obtained as 
solution to the permutation problem IIIa. Here it appears as a combination 
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type of problem since we have used sampling Case III repeatedly in its deri- 
vation. Thus it is futile to try to pin a label on a problem as permutation or 
combination. The majority of combinatorial problems involves a mixture of 


various ways of counting discussed above. We will now illustrate this with 
several worked problems. 


In the remainder of this section we will establish some useful formulas 
connecting binomial coefficients. First of all, let us lay down the convention: 


(3.3.3) (”") = 0 ifm <n, or ifn <0. 


Next we show that 


(3.3.4) (”") - (" 7 1) + (” - '). O<n<m. 


Since we have the explicit evaluation of (”") from (3.2.3), this can of course 
be verified at once. But here is a combinatorial argument without computa- 
tion. Recall that (”) is the number of different ways of choosing n objects 


out of m objects, which may be thought of as being done at one stroke. Now 
think of one of the objects as ‘“‘special.”’ This special one may or may not be 
included in the choice. If it is included, then the number of ways of choosing 
n — 1 more objects out of the other m — 1 objects is equal to (” _ i) 
If it is not included, then the number of ways of choosing all n objects from 


the other m — 1 objects is equal to (” a '), The sum of the two alternatives 


must then give the total number of choices, and this is what (3.3.4) says. 
Isn’t this neat? 
As a consequence of (3.3.4), we can obtain (”), 0<n<_™m, Step by step 


as m increases, as follows: 


(3.3.5) i 4 6 4 1 
1 5 10 10 5 1 
16 15 20 15 6 1 
17 21 35 35 217 1 


For example, each number in the last row shown above is obtained by adding 
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its two neighbors in the preceding row, where a vacancy may be regarded 
as zero: 


I 
21 


0+ 1,7= 1+ 6, 21 = 6+ 15, 35 = 15 + 20, 35 = 20 + 15, 
I58+6,7=6+1,1=1+40. 


n n— 1 n 
Pascal’s triangle, though apparently he was not the first one to have used it. 


Thus (’) = ( 6 ) + (*) for 0 <n <7. The array in (3.3.5) is called 


* Observe that we can split the last term (” , ') in (3.3.4) as we split the 


first term (”") by the same formula applied to m — 1. Thus we obtain, 


successively: 
6 oe Cn fe CO 
(21+ (N29) (42) )= 
The final result is 
(m= Cae (nai G2) +G) 
- >. (," :) oe (,— 1): 


n— | 
n— | 


(3.3.6) 


since the last term in the sum is @ = ( ), and for k <n — 1 the 


terms are zero by our convention (3.3.3). 


_(71\ _ [6 5 4 3\ 
Example. 35 = @ = (5) +(3) + (3) + (3) = 20+ 10+ 4+ 1. 


Look at Pascal’s triangle to see where these numbers are located. 

As an application, we can now give another solution to the counting 
problem for sampling under (IV) in §2. By the second formulation (IV’) 
above, this is the number of ways of putting n indistinguishable [unnumbered] 
tokens into m labeled boxes without restriction. We know from (3.2.6) that 
mt+tn— 1 

m — 


it is equal to ( \ ), but the argument leading to this answer is pretty 
} 


tricky. Suppose we were not smart enough to have figured it out that way, 
but have surmised the result by experimenting with small values of m and n. 
We can still establish the formula in general as follows. [Actually, that tricky 


* The rest of the section may be omitted. 
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argument was probably invented as an after-thought after the result had been 
surmised. | 

We proceed by mathematical induction on the value of m. For m = 1 

clearly there is just one way of dumping all the tokens, no matter how 

l+n— ') _ 

1—1 7 


(5) = 1. Now suppose that the formula holds true for any number of tokens 


many, into the one box, which checks with the formula since ( 


when the number of boxes is equal to m — 1. Introduce a new box; we may 

put any number of tokens into it. If we put 7 tokens into the new one, then 

we must put the remaining n — j tokens into the other m — 1 boxes. Accord- 

m—-2+n-j 
m— 2 

of doing this. Summing over all possible values of 7, we have 


n me A ( k ) 
x ( m— 2 = 2, m—2 
where we have changed the index of summation by setting m — 2 +2 —j 


m+n— 1 
n-1 ) by (3.3.6), and the 


ing to the induction hypothesis, there are ( ) different ways 


= k. The second sum above is equal to( 


induction is complete. 
Next, let us show that 


(3.3.7) (5) + (‘) + (5) ra (") => (7) _ Dp. 


that is, the sum of the nth row in Pascal’s triangle is equal to 2” [the first 
row shown in (3.3.5) is the Oth]. If you know Newton’s Binomial Theorem 
this can be shown from 


(3.3.8) (a+b) = Da (;) akbn—l 


by substituting a = b = 1. But here is a combinatorial proof. The terms on 
the left side of (3.3.7) represent the various numbers of ways of choosing 
0, 1, 2,.. .*, m objects out of n objects. Hence the sum is the total number of 
ways of choosing any subset [the empty set and the entire set both included] 
from a set of size n. Now in such a choice each object may or may not be 
included, and the inclusion or exclusion of each object may be freely com- 
bined with that of any other. Hence the Fundamental Rule yields the total 
number of choices as 


2X 2K 0+ K2Q= 2% 
ee 


n times 
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This is the number given on the right side of (3.3.7). It is the total number of 
distinct subsets of a set of size n. 


Example. For n = 2, all choices from (a, b) are: 
D, {a}, {b}, {ab}. 


For n = 3, all choices from (a, 5, c) are: 


@, {a}, {b}, {c}, {ab}, {ac}, {be}, {abc}. 


Finally, let k < n be two positive integers. We show 


639) (7) = (YC): 


Observe how the indices on top [at bottom] on the right side add up to the 
index on top [at bottom] on the left side; it is not necessary to indicate the 
precise range of j in the summation; we may let j range over all integers 
because the superfluous terms will automatically be zero by our convention 
(3.3.3). 

To see the truth of (3.3.9), we think of the m objects as being separated 
into two piles, one containing k objects and the other m — k. To choose n 
objects from the entire set, we may choose / objects from the first pile and 
n — j objects from the second pile, and combine them. By the Fundamental 
Rule, for each fixed value / the number of such combinations is equal to 


(“) (” a i): So if we allow j to take all possible values and add up the 


results we obtain the total number of choices which is equal to (”). You 


need not worry about “impossible” values for j when j > n orn — j > m — k, 
because the corresponding term will be zero by our convention. 


7 3\ (4 3\ (4 3\ (4 3\ (4 
(3) = (o)(3) + (GQ) + GU) + G)(0) 
7 3\ (4 3\ (4 3\ (4 
(5) = (a)(4) + @)G3) + GG), 
In particular, if k = 1 in (3.3.9), we are back to (3.3.4). In this case our 
argument also reduces to the one used there. 
An algebraic derivation of (3.3.9), together with its extension to the case 


where the upper indices are no longer positive integers, will be given in 
Chapter 6. 


Example. 
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3.4. How to solve it. 


This section may be entitled “How to count.” Many students find these 
problems hard, partly because they have been inured in other elementary 
mathematics courses to the cook book variety of problems such as: “Solve 
x’? — 5x + 10 = 0,” “differentiate xe-*”’ (maybe twice), etc. One can do such 
problems by memorizing certain rules without any independent thought. Of 
course we have this kind of problem in “‘permutation and combination” too, 
and you will find some of these among the exercises. For instance, there is a 
famous formula to do the “round table” problem: “In how many different 
ways can 8 people be seated at a round table?” If you learned it you could 
solve the problem without knowing what the word “different”? means. But a 
little variation may get you into deep trouble. The truth is, and that’s also a 
truism: there is no substitute for true understanding. However, it is not easy 
to understand the principles without concrete applications, so the handful of 
examples below are selected to be the “‘test cases.’”” More are given in the 
Exercises and you should have a lot of practice if you want to become an 
expert. Before we discuss the examples in detail, a few general tips will be 
offered to help you to do your own thing. They are necessarily very broad 
and rather slippery, but they may be of some help sometimes. 


(a) If you don’t see the problem well, try some particular (but not too par- 
ticular) case with small numbers so you can see better. This will fix in 
your mind what is to be counted, and help you especially in spotting 
duplicates and omissions. 

(b) Break up the problem into pieces provided that they are simpler, cleaner, 
and easier to concentrate on. This can be done sometimes by fixing one 
of the “variables,”’ and the number of similar pieces may be counted as a 
subproblem. 

(c) Don’t try to argue step by step if you can see complications rising rapidly. 
Of all the negative advice I gave my classes this was the least heeded but 
probably the most rewarding. Counting step by step may seem easy for 
the first couple of steps but do you see how to carry it through to the end? 

(d) Don’t be turned off if there is ambiguity in the statement of the problem. 
This is a semantical hang-up, not a mathematical one. Try all interpre- 
tations if necessary. This may not be the best strategy in a quiz but it’s a 
fine thing to do if you want to learn the stuff. In any case, don’t take 
advantage of the ambiguities of the English language or the oversight of 


your instructor to turn a reasonable problem into a trivial one. (See 
Exercise 13.) 


Problem 1. (Quality Control). Suppose that in a bushel of 550 apples there 
are 2 percent rotten ones. What is the probability that a “random sample”’ 
of 25 apples contains 2 rotten apples? 
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This is the principle behind testing the quality of products by random 
checking. If the probability turns out to be too small on the basis of the 
claimed percentage, compared with that figured on some other suspected 
percentage, then the claim is in doubt. This problem can be done just as 
easily with arbitrary numbers so we will formulate it in the general case. 
Suppose there are k defective items in a lot of m products. What is the 
probability that a random sample of size n contain j defective items? The 
word “random” here signifies that all samples of size n, under Case III in §3.2, 


are considered equally likely. Hence the total number is (”. How many 


of these contain exactly 7 defective items? To get such a sample we must 
choose any / out of the k defective items, and combine it freely with n — j 


out of the m — k non-defective items. The first choice can be made in (“) 


ways, the second in (” ~*) ways, by sampling under Case III. By the 


Fundamental Rule, the total number of samples of size n containing j defec- 
tive items is equal to the product, and consequently the desired probability 
is the ratio 


eas (=/(N 


In the case of the apples, we have m = 550, k = 11, n = 25, 7 = 2; so the 


probability is equal to 
€ ') (3) / (5s): 
2/\ 23 25 


This number is not easy to compute, but we will learn how to get a good 
approximation later. Numerical tables are also available. 

If we sum the probabilities in (3.4.1) for all 7, 0 <j < n, the result ought 
to equal one since all possibilities are counted. We have therefore proved 


the formula 
ke of{k\(m—k m 
yy Me) - (7) 


by a probabilistic argument. This is confirmed by (3.3.9); indeed a little re- 
flection should convince you that the two arguments are really equivalent. 


Problem 2. If a deck of poker cards are thoroughly shuffled, what is the 
probability that the four aces are found in a row? 

There are 52 cards among which are 4 aces. A thorough shuffling signifies 
that all permutations of the cards are equally likely. For the whole deck, 
there are (52)! outcomes by Case IIa. In how many of these do the 4 aces 
stick together? Here we use tip (b) to break up the problem according to 
where the aces are found. Since they are supposed to appear in a row, we 


3.4. How to solve it. 61 


need only locate the first ace as we check the cards in the order they appear 
in the deck. This may be the top card, the next, and so on, until the 49th. 
Hence there are 49 positions for the 4 aces. After this has been fixed, the 
4 aces can still permute among themselves in 4! ways, and so can the 48 
non-aces. This may be regarded as a case of II]a with r = 2, m, = 4, m,. = 48. 
The Fundamental Rule carries the day and we get the answer 


49-4148)! 24 
(52)! 52-51-50 


This problem is a case where my tip (a) may be helpful. Try 4 cards with 
2 aces. The total number of permutations in which the aces stick together is 
only 3-2!2! = 12 so you can list them all and look. 


Problem 3. Fifteen new students are to be evenly distributed among three 
classes. Suppose that there are three whiz-kids among the fifteen. What is the 
probability that each class gets one? one class gets them all? 

It should be clear that this is the partition problem discussed under Case 
IIb, with m = 15, m, = m,. = m; = 5. Hence the total number of outcomes 
is given by 


15! 
51515! 


To count the number of these assignments in which each class gets one whiz- 
kid we will first assign these three kids. This can be done in 3! ways by IIa. The 
other 12 students can be evenly distributed in the 3 classes by Case IIIb with 
m = 12, m = m, = m; = 4. The Fundamental Rule applies and we get the 
desired probability 


yy 12! /_1st 6-53 
4i4i4i / S515! 15-14-13 


Next, if one class gets them all, then there are 3 possibilities according to 


which class it is, and the rest is similar. So we just replace the numerator 
! 


above by 3 - si151 and obtain 


12! / 15! 5-4.3 


51512! / 51515! 15-14-13 


By the way, we can now get the probability of the remaining possibility, 
namely that the number of whiz-kids in the three classes be two, one, zero 
respectively. 


Problem 4. Six dice are rolled. What is the probability of getting three pairs? 
One can ask at once “which three pairs?” This means a choice of 3 
numbers out of the 6 numbers from 1 to 6. The answer is given by sampling 
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under Case III: (5 


say {2, 3,5} and figure out the probability of getting “‘a pair of 2, a pair 
of 3, and a pair of 5.” This is surely more clear-cut, so my tip (b) should 
be used here. To count the number of ways 6 dice can show the pattern 
{2, 2, 3, 3, 5, 5$, one way is to consider this as putting six labeled tokens (the 
dice as distinguishable) into three boxes marked [2], (3), (5), with two going 
into each box. So the number is given by IIIb: 


) = 20. Now we can concentrate on one of these cases, 


6! 

a2 = 7 

Another way is to think of the dice as six distinguishable performers standing 
in line waiting for cues to do their routine acts, with two each doing acts 
nos. 2, 3, 5 respectively, but who does which is up to Boss Chance. This then 
becomes a permutation problem under IIIa and gives of course the same 
number above. Finally, we multiply this by the number of choices of the 3 
numbers to get 


6\ 6! 
(5) 212121 20-90. 


You may regard this multiplication as another application of the ubiquitous 
Fundamental Rule but it really just means that 20 mutually exclusive cate- 
gories are added together, each containing 90 cases. The desired probability 
is given by 


20-90 25 
66648 


This problem is a case where my negative tip (c) may save you some wasted 
time as I have seen students trying an argument as follows. If we want to 
end up with three pairs, the first two dice can be anything; the third die must 
be one of the two if they are different, and a new one if they are the same. 
The probability of the first two being different is 1/6, in which case the third 
die has probability 2/6; on the other hand the probability of the first two 
being the same is 1/6, in which case the third has probability 5/6. Are you still 
with us? But what about the next step, and the next? 

However, this kind of sequential analysis, based on conditional proba- 
bilities, will be discussed in Chapter 5. It works very well sometimes, as in 
the next problem. 


Problem 5. (Birthdays) What is the probability that among n people there 
are at least two who have the same birthday? We are assuming that they 
“choose” their birthdays independently of one another, so that the result is 
as if they had drawn n balls marked from 1 to 365 (ignoring leap years) by 
sampling under Case I. All these outcomes are equally likely and the total 


3.4. How to solve it. 63 


number is (365)". Now we must count those cases in which some of the balls 
drawn bear the same number. This sounds complicated but it is easy to 
figure out the “‘opposite event,’’ namely when all 7 balls are different. This 
falls under Case II and the number is (365),. Hence the desired probability is 


_ 365)n 
(365)" 


Pn = 1 


What comes as a surprise is the numerical fact that this probability exceeds 
1/2 as soon as n > 23; see table below.* What would you have guessed? 


n Pn 
5 .03 
10 12 
15 25 
20 4] 
21 44 
22 48 
23 SI 
24 54 
25 57 
30 71 
35. 81 
40 89 
45 94 
50 97 
35 99 


One can do this problem by a naive argument which turns out to be 
correct. To get the probability that n people have all different birthdays, we 
order them in some way and consider each one’s “‘choice,”’ as in the case of 
the six dice which show different faces (Example 2, §3.1). The first person can 
have any day of the year for his birthday, hence probability 1; the second can 
have any but one, hence probability 364/365; the third any but two, hence 
probability 363/365; and so on. Thus the final probability is 


365 364363 (n factors) 
365 365 365 


which is just another way of writing (365),/(365)". The intuitive idea of 
sequential conditional probabilities used here is equivalent to a splitting dia- 
gram described in Section 3.1, beginning with 365 cases, each of which splits 


* Computation from = 2 ton = 55 was done on a small calculator to five decimal places 
in a matter of minutes. 
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into 364, then again into 363, etc. If one divides out by 365 at each stage one 
gets the product above. 


Problem 6. (Matching). Four cards numbered | to 4 are laid face down on 
a table and a person claiming clairvoyance will name them by his extrasensory 
power. If he is a faker and just guesses at random, what is the probability 
that he gets at least one right? 

There is a neat solution to this famous problem by a formula to be 
established later in §6.2. But for a small number like 4, brute force will do 
and in the process we shall learn something new. Now the faker simply picks 
any one of the 4! permutations and these are considered equally likely. Using 
tip (c), we will count the number of cases in which there is exactly 1 match. 
This means the other three cards are mismatched, and so we must count the 
‘“‘no match”’ cases for three cards. This can be done by enumerating all the 
3! = 6 possible random guesses as tabulated below: 


real (abc) (abc) (abc) (abc) (abc) (abc) 


guess (abc) (acb) (bac) (bca) (cab) (cba) 


There are two cases of no-match: the 4th and 5th above. We obtain all 
cases in which there is exactly one match in 4 cards by fixing that one match 
and mismatch the three other cards. There are 4 choices for the card to be 
matched, and after this is chosen, there are 2 ways to mismatch the other 
three by the tabulation above. Hence by the modified Fundamental Rule 
there are 4-2 = 8 cases of exactly one match in 4 cards. Next, fix two matches 
and mismatch the other two. There is only one way to do the latter, hence 
the number of cases of exactly two matches in 4 cards is equal to that of 


choosing two cards (to be matched) out of 4, which is (5) = 6, 


Finally, it is clear that if three cards match the remaining one must also, 
and there is just one way of matching them all. The results are tabulated as 
follows: 


Exact number 
of matches | Number of cases | Probability 


4 I 1/24 
3 0 0 
2 6 1/4 
It 8 1/3 
0 9 3/8 


The last row above, for the number of cases of no-match, is obtained by 
subtracting the sum of the other cases from the total number: 


24—(1+6+ 8) = 9. 
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The probability of at least one match is 15/24 = 5/8; of at least two is 7/24. 

You might propose to do the counting without any reasoning by listing 
all 24 cases for 4 cards, as we did for 3 cards. That is a fine thing to do, not 
only for your satisfaction but also to check the various cases against our 
reasoning above. But our step leading from 3 cards to 4 cards is meant to 
be an illustration of the empirical inductive method, and can lead also from 
4 to 5, etc. In fact, that is the way the computing machines do things. They 
are really not very smart, and always do things step by step, but they are 
organized and tremendously fast. In our case a little neat algebra does it 
better, and we can establish the following general formula for the number of 
cases of at least one match for » cards: 


] l l 


This number has been called the derangement number for n, see §6.2. 


Problem 7. In how many ways can n balls be put into n numbered boxes so 
that exactly one box is empty? This problem is instructive as it illustrates 
several points made above. First of all, it is ambiguous whether the balls are 
distinguishable or not. Using my tip (d), we will treat both hypotheses. 


Hypothesis 1. The balls are indistinguishable. Then it is clearly just a matter 
of picking the empty box and the one which must have two balls. This is a 
sampling problem under Case II and the answer is (”). = n(n — 1). 

This easy solution would probably be acceptable granted the ambiguous 
wording, but we learn more if we try the harder way too. 


Hypothesis 2. The balls are distinguishable. Then after the choice of the two 
boxes as under Hypothesis | (call it step 1), we still have the problem as to 
which ball goes into which box. This is a problem of partition under Case 
liIb with m = n, m = 2, mz, = --- = M1 = 1, the empty box being left 
out of consideration. Hence the answer is 


n! n! 
@G.4.2) Wt ~ 2 

You don’t have to know about that formula, since you can argue directly as 
follows. The question is how to put 2 numbered balls into 7 — 1 numbered 
boxes with 2 balls going into a certain box (already chosen by step 1) and 


1 ball each into all the rest. There are (5) ways of choosing the two balls 


to go into that particular box, after that the remaining n — 2 balls can go 
into the other n — 2 boxes in (n — 2)! ways. The product of these two num- 
bers is the same as (3.4.1). Finally the total number of ways under Hypothesis 
2 1s given by 
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(3.4.3) n(n — 1) - > 


We have argued in two steps above. One may be tempted to argue in 
three steps as follows. First choose the empty box, then choose n — | balls 
and put them one each into the other n — 1 boxes, finally throw the last ball 
into any one of the latter. The number of possible choices for each step is 
equal respectively to n, (”),1 (by sampling under II) and n — 1. If they are 
multiplied together the result is 


(3.4.4) n-ni(n — 1) 


which is twice as large as (3.4.3). Which is correct? 

This is the kind of situation my tip (a) is meant to help. Take n = 3 and 
suppose the empty box has been chosen, so the problem is to put balls 1, 2, 3 
into box A and B. For the purpose of the illustration let A be square and 
B round. Choose two balls to put into these two boxes; there are six cases 
as shown: 


@® BO WO WO B® BI® 


Now throw the last ball into one of the two boxes, so that each case above 
splits into two according as which box gets it: 


HO BO HO BO BO B® 
We FO 8 BO BO B® 


You see what the trouble is. Each final case is counted twice because the box 
which gets two balls can get them in two orders! The trouble is the same in 
the general case and so we must divide the number in formula (3.4.4) by 2 
to eliminate double-counting, which makes it come out as in formula (3.4.3). 
All is harmony. 


Exercises 


(When probabilities are involved in the problems below, the equally likely 
cases should be “obvious” from the context. In case you demur, follow my 
tip (d).) 

1. A girl decides to choose either a shirt or a tie for a birthday present. 
There are 3 shirts and 2 ties to choose from. How many choices does 
she have if she will get only one of them? if she may get both a shirt 
and a tie? 

2. There are 3 kinds of shirts on sale. (a) If two men buy one shirt each 
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10. 


11. 


14.* 


15. 


how many possibilities are there? (b) If two shirts are sold, how many 
possibilities are there? 
As in No. 2 make up a good question with 3 shirts and 2 men, to which 
342-1 

3 ) 
If on the menu shown in §3.1 there are 3 kinds of ice cream and 2 kinds 
of pie to choose from, how many different dinners are there? If we take 
into account that the customer may skip the vegetable or the dessert or 
both, how many different dinners are there? 
How many different initials can be formed with 2 or 3 letters of the 
alphabet? How large must the alphabet be in order that one million 
people can be identified by 3-letter initials? 
How many integers are there between one million and ten million, in 
whose decimal form no two consecutive digits are the same? 
In a “true or false” test there are 12 questions. If a student decides to 
check six of each at random, in how many ways can he do it? 
In how many ways can 4 boys and 4 girls pair off? In how many ways 
can they stand in a row in alternating sex? 
In how many ways can a committee of three be chosen from 20 people? 
In how many ways can a president, a secretary and a treasurer be 
chosen? 
If you have 2 dollars, 2 quarters and 3 nickels, how many different 
sums can you pay without making change? Change the quarters into 
dimes and answer again. 
Two screws are missing from a machine which has screws of three 
different sizes. If three screws of different sizes are sent over what is the 
probability that they are what’s needed? 
There are two locks on the door and the keys are among the six different 
ones you carry in your pocket. In a hurry you dropped one somewhere. 
What is the probability that you can still open the door? What is the 
probability that the first two keys you try will open the door? 
A die is rolled three times. What is the probability that you get a larger 
number each time? (I gave this simple problem in a test but used in- 
advertently the words “. . . that the numbers you obtain increase 
steadily.” Think of a possible misinterpretation of the words!) 
Three dice are rolled twice. What is the probability that they show the 
same numbers (a) if the dice are distinguishable, (b) if they are not. 
(Hint: divide into cases according to the pattern of the first throw: a 
pair, a triple or all different; then match the second throw accordingly. | 
You walk into a party without knowing anyone there. There are 6 
women and 4 men and you know there are 4 married couples. In how 
many ways can you guess who the couples are? What if you know there 
are exactly 3 couples? 
Four shoes are taken at random from five different pairs. What is the 
probability that there is at least one pair among them? 


the answer is 2°, or ( 
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Counting 


A California driver decides that he must switch lanes every minute to 
get ahead. If he is on a 4-lane divided highway and does this at random, 
what is the probability that he is back on-his original lane after 4 minutes 
(assuming no collision)? [Hint: the answer depends on whether he starts 
on an outside or inside lane. | 

In sampling under Case I or II of §3.2, what is the probability that in 
n drawings a particular ball is never drawn? Assume n < m. 

You are told that of the four cards face down on the table, two are 
red and two are black. If you guess all four at random, what is the 
probability that you get 0, 2, 4 right? 

An airport shuttle bus makes 4 scheduled stops for 15 passengers. What 
is the probability that all of them get off at the same stop? What is the 
probability that someone (at least one person) gets off at each stop? 
Ten books are made into 2 piles. In how many ways can this be done 
if books as well as piles may or may not be distinguishable? Treat all 
4 hypotheses and require that neither pile be empty. 

Ten different books are to be given to Daniel, Phillip, Paul and John 
who will get in the order given 3, 3, 2, 2 books respectively. In how many 
ways can this be done? Since Paul and John screamed “no fair”’ it is 
decided that they draw lots to determine which two get 3 and which two 
get 2. How many ways are there now for a distribution? Finally, Marilda 
and Corinna also want a chance and so it is decided that the six 
kids should draw lots to determine which two get 3, which two get 2 
and which two get none. Now how many ways are there? (There is real 
semantical difficulty in formulating these distinct problems in general. 
It is better to be verbose than concise in such a situation. Try putting 
tokens into boxes.) 

In a draft lottery containing the 366 days of the year (including February 
29), what is the probability that the first 180 days drawn (without re- 
placement of course) are evenly distributed among the 12 months? What 
is the probability that the first 30 days drawn contain none from August 
or September? [Hint: first choose 15 days from each month.] 

At a certain resort the travel bureau finds that tourists occupy the 20 
hotels there as if they were so many look-alike tokens (fares) placed in 
numbered boxes. If this theory is correct, what is the probability that 
when the first batch of 30 tourists arrive, no hotel is left vacant? [This 
model is called the Bose-Einstein statistic in physics. If the tourists are 
treated as distinct persons it is the older Boltzmann-Maxwell statistic; 
see [Fellerl; §II.5].] 

100 trout are caught in a little lake and returned after they are tagged. 
Later another 100 are caught and found to contain 7 tagged ones. What 
is the probability of this if the lake contains n trout? (What is your best 
guess as to the true value of n? The latter is the kind of question asked 
in Statistics.) 

Program a one-to-one correspondence between the various possible 
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28.* 


29.* 


cases under the two counting methods in IIa and IIIb by taking m = 4, 
m = 2, Mm, = m3 = 1. 

(For poker players only.) In a poker hand assume all “hands” are 
equally likely as under Sampling Case III. Compute the probability of 
(a) flush (b) straight (c) straight flush (d) four of a kind (e) full house. 


Show that 
2n 2 f{n\? 
( ) 7 XX (;) | 
[Hint: apply (3.3.9). ] 


The number of different ways in which a positive integer n can be 
written as the sum of positive integers not exceeding n is called (in 
number theory) the “‘partition number” of x. For example, 


6 = 6 sextuple 
=35+1 quintuple 
=4+42 quadruple and pair 
=4+1+1 quadruple 
= 3+ 3 two triples 
=3+2+1 triple and pair 
=3+1+1+1 triple 
=2+2+2 three pairs 
=2+2+1+1 two pairs 
=2+1+1+1+1 one pair 


l+1+1+1+1-+ 1. all different (“no same’) 


Thus the partition number of 6 is equal to 11; compare this with the 
numbers 46656 and 462 given in Examples 2 and 3. This may be called 
the total number of distinguishable “coincidence patterns” when six dice 
are thrown. We have indicated simpler (but vaguer) names for these 
patterns in the listing above. Compute their respective probabilities. (It 
came to me as a surprise that “two pairs’’ is more probable than “one 
pair’ and has a probability exceeding 1/3. My suspicion of an error in 
the computation was allayed only after I had rolled six dice one hundred 
times. It was an old custom in China to play this game over the New 
Year holidays, and so far as I can remember, “‘two pairs’ were given a 
higher rank (prize) than ‘“‘one pair.”’ This is unfair according to their 
probabilities. Subsequently to my own experiment I found that Feller 
had listed analogous probabilities for 7 dice. His choice of this “random 
number” 7 in disregard or ignorance of a time-honored game had prob- 
ably resulted in my overlooking his tabulation.) 
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30.* (Banach’s match problem.) The Polish mathematician Banach kept two 
match boxes, one in each pocket. Each box contains n matches. When- 
ever he wanted a match he reached out at random into one of his 
pockets. When he found that the box he picked was empty, what is the 
distribution of the number of matches left in the other box? [Hint: 
divide into two cases according as the left or right box is empty, but be 
careful about the case when both are empty. | 


Chapter 4 


Random Variables 


4.1 What is a random variable? 


We have seen that the points of a sample space may be very concrete 
objects such as apples, molecules and people. As such they possess various 
qualities some of which may be measurable. An apple has its weight and 
volume; its juicy content can be scientifically measured, even its taste may 
be graded by expert tasters. A molecule has mass and velocity, from which 
we can compute its momentum and kinetic energy by formulas from physics. 
For a human being there are physiological characteristics such as age, height 
and weight. But there are many other numerical data attached to him (or her) 
like I.Q., number of years of schooling, number of brothers and sisters, annual 
income earned and taxes paid, and so on. We will examine some of these 
illustrations and then set up a mathematical description in general terms. 


Example 1. Let 2 be a human population containing n individuals. These 
may be labeled as 


(4.1.1) Q = {ay, we, ... 4 Wn}. 


If we are interested in their age distribution, let A(w) denote the age of w. 
Thus to each w is associated a number A(w) in some unit, such as “‘year.”’ 
So the mapping 


a —> A(w) 


is a function with Q as its domain of definition. The range is a set of integers 
but can be made more precise by fractions or decimals or spelled out as e.g., 
“18 years, 5 months and 1 day.” There is no harm if we take all positive 
integers or all positive real numbers as the range, although only a very small 
portion of it will be needed. Accordingly, we say A is an integer-valued or 
real-valued function. Similarly, we may denote the height, weight and income 
by the functions: 

w— Aw), 

w > Ww), 


w —> Iw). 


In the last case J may take negative values! Now for some medical purposes, 
a linear combination of height and weight may be an appropriate measure: 


w — A\H(w) + pW) 
71 
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where \ and p are two numbers. This then is also a function of w. Similarly, 
if w is a “head of family” alias breadearner, the census bureau may want to 
compute the function: 


I@) 
N@) 


Gd 


where M(w) is the number of persons in his family, namely the number of 
mouths to be fed. The ratio above represents then the “income per capita”’ 
for the family. 

Let us introduce some convenient symbolism to denote various sets of 
sample points derived from random variables. For example, the set of w in 0 
for which the age is between 20 and 40 will be denoted by 


{w | 20 < Aw) < 40} 
or more briefly when there is no danger of misunderstanding by 
{20 < A < 40}. 


The set of w for which the height is between 65 and 75 (in inches) and the 
weight is between 120 and 180 (in pounds) can be denoted in several ways 
as follows: 


{w | 65 < Hw) < 75} 1 {w | 120 < Ww) < 180} 
= {w|65 < H(w) < 75; 120 < Ww) < 180} 
= {65 < H< 75; 120 < W< 180}. 
Example 2. Let 2 be gaseous molecules in a given container. We can still 
represent 2 as in (4.1.1) even though n is now a very large number such as [0”°. 


Let m = mass, v = velocity, M = momentum, E = kinetic energy. Then we 
have the corresponding functions: 


w— mw), 

w—> vw), 

w— Mw) = me), 
w— Ew) = ; m(w)v(w)?. 


In experiments with gases actual measurements may be made of m and 2, 
but the quantities of interest may be M or E which can be derived from the 
formulas. Similarly, if @ is the angle of the velocity relative to the x-axis, 
w — Ow) is a function of w, and 


w— COS Ow) 
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cos 0(«) 
Figure 17 


may be regarded as the composition of the function “‘cos”’ with the function 


“6.” The set of all molecules which are moving toward the right is represented 
by 


{w | cos 0(w) > O}. 


Example 3. Let Q be the outcome space of throwing a die twice. Then it 
consists of 6? = 36 points listed below: 


(1, 2) 
(2, 2) 
(3, 2) 


(4, 2) 
(5, 2) 
(6, 2) 


Thus each w is represented by an ordered pair or a two-dimensional vector: 
WE = (Xz, Ve)s k= Ls 2 ee OO 

where x, and y, take values from 1 to 6. The first coordinate x represents the 

outcome of the first throw, the second coordinate y the outcome of the second 


throw. These two coordinates are determined by the point w, hence they are 
functions of w: 


(4.1.2) w—->X(w), w— yw). 
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On the other hand, each w is completely determined by its two coordinates, 
so much so that we may say that w is the pair of them: 


w = (xv), yo). 


This turnabout is an important concept to grasp. For example, let the die be 
thrown n times and the results of the successive throws be denoted by 


x1(w), Xo(w), weg Xn(w), 


then not only each x;(w), kK = 1, 2,...,, is a function of w which may be 
called its kth coordinate, but the totality of these n functions in turn deter- 
mines w, and therefore w is nothing more or less than the n-dimensional vector 


(4.1.3) w = (x(a), Xow), . . «5 Xn(w)). 


In general, each x,(w) represents a certain numerical characteristic of the 
sample w, and although w may possess many, many characteristics, in most 
questions only a certain set of them is taken into account. Then a represen- 
tation like (4.1.3) is appropriate. For exampie, in a traditional beauty con- 
test, only three bodily measurements given in inches, are considered such as 
(36, 29, 38). In such a contest (no “song and dance’’) each contestant is 
reduced to such an ordered triple: 


contestant = (x, y, Z). 


Another case of this kind is when a student takes a number of tests, say 4, 
which are graded on the usual percentage basis. Let the student be w, his 
score on the 4 tests be x1(w), X2(w), X3(w), x.(w). For the grader (or the com- 
puting machine if all the tests can be machine processed), each w is just the 
4 numbers: (x(w), . . . , Xi(w)). Two students who have the same scores are 
not distinguished. Suppose the criterion for success is that the total should 
exceed 200; then the set of successful candidates is represented by 


{w | x1(w) + X2(w) + 3(w) + xu(w) > 200}. 
A variation is obtained if different weights \;, Az, As, 4 are assigned 
to the 4 tests, then the criterion will depend on the linear combination 
\yXi(w) + +--+ + A4x,(w). Another possible criterion for passing the tests 1s 
given by 


{w | min (%1(w), Xe(w), X3(w), Xa(w)) > 35}. 


What does this mean in plain English? 
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4.2. How do random variables come about? 


We can now give a general formulation for numerical characteristics of sam- 
ple points. Assume first that Q is a countable space. This assumption makes an 
essential simplification which will become apparent; other spaces will be dis- 
cussed later. 


Definition of Random Variable. A numerically valued function X of w with 
domain Q: 


(4.2.1) wEQ: w— Xw) 


is called a random variable [on Q]. 

The term “‘random variable”’ is well established and so we will use it in 
this book but ‘“‘chance variable” or “stochastic variable” would have been 
good too. The adjective “random” is just to remind us that we are dealing 
with a sample space and trying to describe certain things which are commonly 
called “‘random events” or “chance phenomena.”’ What might be said to have 
an element of randomness in X(w) is the sample point w which is picked “at 
random,”’ such as in a throw of dicé or the polling of an individual from a 
population. Once w is picked X(w) is thereby determined and there is nothing 
vague, indeterminate or chancy about it anymore. For instance after an apple 
w is picked from a bushel its weight Ww) can be measured and may be con- 
sidered as known. In this connection the term ‘“‘variable”’ should also be 
understood in the broad sense as a “‘dependent variable,’ namely a function 
of w, as discussed in §4.1. We can say that the sample point w serves here as 
an “independent variable” in the same way the variable x in sin x does, but 
it is better not to use this language since “independent” has a very different 
and more important meaning in probability theory (see §5.5). 

Finally it is a custom (not always observed) to use a capital letter to 
denote a random variable, such as X, Y, N or S, but there is no reason why 
we cannot use small letters x or y as we did in the Examples of §4.1. 

Observe that random variables can be defined on a sample space before 
any probability is mentioned. Later we shall see that they acquire their prob- 
ability distributions through a probability measure imposed on the space. 

Starting with some random variables, we can at once make new ones by 
operating on them in various ways. Specific examples have already been given 
in the examples in §4.1. The general proposition may be stated as follows: 


Proposition 1. Jf X and Y are random variables, then so are 
(4.2.2) X+Y, X—Y, XY, X/Y(Y #0), 


and aX + bY where a and b are two numbers. 
This is immediate from the general definition, since, e.g. 
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w— X(w) + Y(w) 


is a function on 2 as well as X and Y. The situation is exactly the same as 
in calculus: if f and g are functions then so are 


f+e, f-2, fe, f/e(g #0), af + bg. 


The only differénce is that in calculus these are functions of x, a real number; 
while here the functions in (4.2.2) are those of w, a sample point. Also, as in 
calculus where a constant is regarded as a very special kind of function, so is 
a constant a very special kind of random variable. For example it is quite 
possible that in a class in elementary school, all the pupils are of the same 
age. Then the random variable A(w) discussed in Example | of §4.1 is equal 
to a constant, say = 9 (years) in a fourth grade class. 

In calculus a function of a function is still a function such as x— 
log (sin x) or x > f(¢(x)) = (fe o)(x). A function of a random variable is 
still a random variable such as the cos @ in Example 2 of §4.1. More generally 
we can have a function of several random variables. 


Proposition 2. If ¢ is a function of two (ordinary) variables and X and Y are 
random variables, then 


(4.2.3) w— XW), Y)) 


is also a random variable, which is denoted more concisely as o(X, Y). 

A good example is the function g(x, y) = Vx? + y*, Suppose X(w) and 
Y(w) denote respectively the horizontal and vertical velocities of a gas mole- 
cule, then 


of X, Y) = V X24 ¥? 


will denote its absolute speed. 

Let us note in passing that Proposition 2 contains Proposition 1 as a par- 
ticular case. For instance if we take g(x, y) = x + y, then o(X, Y) = X +4 Y. 
It also contains functions of a single random variable as a particular case 
such as f(X). Do you see why? Finally, extension of Proposition 2 to more 
than two variables is obvious. A particularly important case is the sum of 
n random variables: 


(4.2.4) S.(w) = X(w) +--+ + X(w) = x X,(w). 


For example if Xi, ..., Xn denote the siiccessive outcomes of a throw of a 
die, then S,, is the total obtained in n throws. We shall have much to do with 
these partial sums Sy. 

We will now illustrate the uses of random variables in some everyday 
situations. Quite often the intuitive notion of some random quantity precedes 
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that of a sample space. Indeed one can often talk about random variables X, 
Y, etc. without bothering to specify 2. The rather formal (and formidable?) 
mathematical set-up serves as a necessary logical backdrop, but it need not 
be dragged into the open on every occasion when the language of probability 
can be readily employed. 


Example 4. The cost of manufacturing a certain book is $3 per book up to 
1000 copies, $2 per copy between 1000 and 5000 copies, and $1 per copy 
afterwards. In reality of course books are printed in round lots and not on 
demand “‘as you go.” What we assume here is tantamount to selling all over- 
stock at cost, with no loss of business due to understock. Suppose we print 
1000 copies initially and price the book at $5. What is “‘random”’ here is the 
number of copies that will be sold; call it _X. It should be evident that once X 
is known, we can compute the profit or loss from the sales; call this Y. Thus 
Y is a function of X and is random only because X is so. The formula con- 
necting Y with X is given below (See Fig. 18 on page 78): 


5X — 3000 if X < 1000, 
Y = 42000 + 3(X¥ — 1000) __ if 1000 < X < 5000, 
14000 + 4(X — 5000) if X > 5000. 


What is the probability that the book is a financial loss? It is that of the event 
represented by the set 


{5X — 3000 < 0} = {X < 600}. 


What is the probability that the profit will be at least $10000? It is that of 
the set 


{2000 + 3(X — 1000) > 10000} U {xX > 5000} 


_ {x> a + 1000} U LX > 5000} 


= {X¥ > 3667}. 


But what are these probabilities? They will depend on a knowledge of X. 
One can only guess at it in advance; so it is a random phenomena. But after 
the sales are out, we shall know the exact value of X; just as after a die is 
cast we shall know the outcome. The various probabilities are called the 
distribution of X and will be discussed in §4.3 below. 

What is the sample space here? Since the object of primary interest is X, 
we may very well take it as our sample point and call it w instead to conform 
with our general notation. Then each w is some positive integer and w — Y(w) 
is a random variable with Q the space of positive integers. To pick an w means 
in this case to hazard a guess (or make a hypothesis) on the number of sales, 
from which we can compute the profit by the preceding formula. There is 
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Y¥ (w) 


14,000 


/ 1000 5000 X (w) 
3000 


Figure 18 


nothing wrong about this model of a sample space, though it seems a bit 
superfluous. 

A more instructive way of thinking is to consider each w as representing 
“‘a possible sales-record for the book.” A publisher is sometimes interested 
in other information than the total number of sales. An important factor 
which has been left out of consideration above is the time element involved 
in the sale. Surely it makes a difference whether 5000 copies are sold in one 
or ten years. If the book is a college text like this one, it may be important 
to know how it does in different types of schools and in different regions of 
the country. If it is fiction or drama it may mean a great deal (even only 
from the profit motive) to know what the critics say about it, though this 
would be in a promotions rather than sales record. All these things may be 
contained in a capsule which is the sample point w. You can imagine it to 
be a complete record of every bit of information pertaining to the book, of 
which X(w) and Y(w) represent only two facets. Then what is Q? It is the 
totality of all such conceivable records. This concept of a sample space may 
sound weird and is unwieldy (can we say that Q is countable?), but it gives 
the appropriate picture when one speaks of e.g. the path of a particle in 
Brownian motion or the evolution of a stochastic process (see Chapter 8). 
On the other hand, it also shows the expediency of working with some specific 
random variables rather than worrying about the whole universe. 
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Example 5. An insurance company receives claims for indemnification from 
time to time. Both the times of arrival of such claims and their amounts are 
unknown in advance and determined by chance; ergo, random. The total 
amount of claims in one year, say, is of course also random, but clearly it 
will be determined as soon as we know the “‘when” and “how much”’ of the 
claims. Let the claims be numbered as they arrive and let S,, denote the date 
of the mth claim. Thus S; = 33 means the third claim arrives on February 
the second. So we have 


I< S<h<c--- SS <--- 


and there is equality whenever several claims arrive on the same day. Let 
the amount of the mth claim be C,, (in dollars). What is the total number of 
claims received in the year? It is given by N where 


N = max {n|S, < 365}. 


Obviously N is also random but it is determined by the sequence of S,’s; in 
theory we need to know the entire sequence because N may be arbitrarily 
large. Knowing WN, and the sequence of C,’s, we can determine the total 
amount of claims in that year: 


(4.2.5) Chess + Cy 


in the notation of (4.2.4). Observe that in (4.2.5) not only each term on the 
right side is a random variable but also the number of terms. Of course the 
sum is also a random variable. It depends on the S,’s as well as the C,,’s. 

In this case we can easily imagine that the claims arrive at the office one 
after another and a complete record of them is kept in a ledger printed 
like a diary. Under some dates there may be no entry, under others there may 
be many in various different amounts. Such a ledger is kept over the years 
and will look quite different from one period of time to another. Another 
insurance company will have another ledger which may be similar in some 
respects and different in others. Each conceivable account kept in such a 
ledger may be considered as a sample point, and a reasonably large collection 
of them may serve as the sample space. For instance, an account in which 
one million claims arrive on the same day may be left out of the question, 
or a claim in the amount of 95 cents. In this way we can keep the image of 
a sample space within proper bounds of realism. 

If we take such a view, other random variables come easily to mind. For 
example, we may denote by Y;, the total amount of claims on the kth day. 
This will be the number which is the sum of all the entries under the date, 
possibly zero. The total claims from the first day of the account to the nth 
day can then be represented by the sum 


(4.2.6) Z, = x %=Y¥it Yot--- + Yy. 
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The total claims over any period of time [s, ¢] can then be represented as 
t 
(4.2.7) Zi —- Z.-1 = pe Y, = Yet Your t-+> + Vi. 


We can plot the accumulative amount of claims Z; against the time t by a 
graph of the following kind. 


Z(t) 


Figure 19 


There is a jump at ¢ when the entry for the ¢th day is not empty, and the 
size of the rise is the total amount of claims on that day. Thus the successive 
rises correspond to the Y;,’s which are greater than 0. Clearly you can read 
off from such a graph the total claim over any given period of time, and 
also e.g. the lengths of the “free periods” between the claims, but you can- 
not tell what the individual claims are when several arrive on the same day. 
If all the information you need can be got from such a graph then you 
may regard each conceivable graph as a sample point. This will yield a 
somewhat narrower sample space than the one described above, but it will 
serve our purpose. From the mathematical point of view, the identifi- 
cation of a sample point with a graph (also called a sample curve, path, 
or trajectory) is very convenient, since a curve is a more precise (and familiar!) 
object than a ledger or some kind of sales-and-promotions-record. 


4.3. Distribution and expectation 
In Chapter 2 we discussed the probabilities of sets of sample points. These 


sets are usually determined by the values of random variables. A typical 
example is 


(4.3.1) fac Xb} = fwlas X@) < 5} 
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where X is a random variable, a and b are two constants. Particular cases of 
this have been indicated among the examples in §4.1. Since every subset of 2 
has a probability assigned to it when Q is countable, the set above has a 
probability which will be denoted by 


(4.3.2) P(ia< X <5). 


More generally let A be a set of real numbers, alias a set of points on the 
real line R! = (—o, +); then we can write 


(4.3.3) P(X € A) = P({w| X(w) € A}). 


For instance, if A is the closed interval [a, b] then this is just the set in (4.3.2); 
but A may be the open interval (a, b), half-open interval (a, b] or [a, b), 
infinite intervals (—©, b) or (a, +); the union of several intervals, or a set 
of integers say {m,m-+1,...,m-+n}. An important case occurs when A 
reduces to a single point x; it is then called the singleton {x}. The distinction 
between the point x and the set {x} may seem academic. Anyway, the 
probability 


(4.3.4) P(X = x) = P(XE {x}) 


is “the probability that X takes (or assumes) the value x.” If Y is the age of 
a human population, {X = 18} is the subpopulation of 18-year-olds—a very 
important set! 

Now the hypothesis that Q is countable will play an essential simplifying 
role. Since X has 2 as domain of definition, it is clear that the range of X 
must be finite when © is finite, and at most countably infinite when Q is so. 
Indeed the exact range of X is just the set of real numbers below: 


(4.3.5) Vx = U {X@)}, 
wFEQ 


and many of these numbers may be the same. For the mapping w — X(w) is 
in general many-to-one, not necessarily one-to-one. In the extreme case when 
X is a constant random variable, the set Vy reduces to a single number. Let 
the distinct values in Vx be listed in any order as 


{v, Vo, .- +5 Un,. Se 


The sequence may be finite or infinite. Clearly if x ¢ Vx, namely if x is not 
one of the values v,, then P(X = x) = 0. On the other hand, we do not forbid 
that some v, may have zero probability. This means that some sample points 
may have probability zero. You may object: why don’t we throw such nuisance 
points out of the sample space? Because it is often hard to know in advance 
which ones to throw out. It is easier to leave them in since they do no harm. 
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[In an uncountable Q, every single point w may have probability zero! But 
we are not talking about this at present; see §4.5 below. | 
Let us introduce the notation 


(4.3.6) Pn = P(X = tn), Un € Vx. 


It should be obvious that if we know all the p,’s, then we can calculate all 
probabilities concerning the random variable X, alone. Thus, the probabilities 
in (4.3.2) and (4.3.3) can be expressed in terms of the p,’s as follows: 


(4.3.7) Pa<X<b)= D prs PXCA)= X, Dn- 


a <n <b 


The first is a particular case of the second, and the last-written sum reads 
this way: “‘the sum of the p,’s for which the corresponding v,’s belong to A.” 

When A is the infinite interval (—», x] for any real number x, we can 
introduce a function of x as follows: 


(4.3.8) Fxy(x) = PX < x)= DX pp. 


Un SZ 


This function x — Fx(x) defined on R! is called the distribution function of X. 
Its value at x “picks up” all the probabilities of values of X up to x (inclusive); 
for this reason the adjective “cumulative” is sometimes added to its name. For 
example if X is the annual income (in $’s) of a breadwinner, then Fy(10000) 
is the probability of the income group earning anywhere up to ten thousand, 
and can theoretically include all those whose incomes are negative. 

The distribution function Fy is determined by the v,’s and p,’s as shown 
in (4.3.8). Conversely if we know Fx, namely we know Fx(x) for all x, we can 
“recover” the v,’s and p,’s. We will not prove this fairly obvious assertion 
here. For the sake of convenience, we shall say that the two sets of numbers 
{vn} and {p,} determine the probability distribution of X, where any »v, for 
which p, = 0 may be omitted. It is easy to see that if a < b, then 


(43.9) Pla< X <b) = P(X < b)— P(X < a) = Fx(b) — Frx(a); 


but how do we get P(a < X < b), or P(X = x) from Fx? (see Exercise 7 
below). 

Now to return to the p,’s, which are sometimes called the “elementary 
probabilities” for the random variable X. In general they have the following 
two properties: 


(4.3.10) (i) Wn: Pn > 0, 
(ii) Y Pa = 1, 


Compare this with (2.4.1). The sum in (ii) may be over a finite or infinite 
sequence according as Vx is a finite or infinite set. The property (i) is obvious, 
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apart from the observation already made that some p, may = 0. The property 
(ii) says that the values {v,} in Vx exhaust all possibilities for X, hence their 
probabilities must add up to that of the “whole universe.” This is a fine way 
to say things, but let us learn to be more formal by converting the verbal 
argument into a symbolic proof. We begin with 


U {X = o,} = Q. 


Since the v,’s are distinct, the sets {X = v,} must be disjoint. Hence by count- 
able additivity (see §2.3) we have 


> P(X = v,) = PQ) = 1. 


This is property (ii). 

Before making further specialization on the random variables, let us for- 
mulate a fundamental new definition in its full generality. It is motivated by 
the intuitive notion of the average of a random quantity. 


Definition of Mathematical Expectation. For a random variable X defined on 
a countable sample space Q, its mathematical expectation is the number ECX) 
given by the formula 


(4.3.11) E(X) = 2, X(w)P({w}), 
provided that the series converges absolutely, namely 


(4.3.12) x |X(w)|P({w}) <2. 


In this case we say that the mathematical expectation of X exists. The process 
of “taking expectations” may be described in words as follows: take the value 
of X at each w, multiply it by the probability of that point, and sum over 
all w in Q. If we think of P({w}) as the weight attached to w then E(X) 1s 
the weighted average of the function X. Note that if we label the w’s as 
{w1, We, .. +5 Wn,...}, then we have 


E(X) = 2 X(@n)PCon}). 
But we may as well use w itself as label, and save a subscript, which explains 
the cryptic notation in (4.3.11). 


Example 6. Let Q = {w1,...,w7} be a parcel of land sub-divided into seven 
“lots for sale.’’ These lots have percentage areas: 


5% 10%, 10%, 10%, 157%, 20%, 307%: 
$800, $900, $1000, $1200, $800, $900, $800. 
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Define X(w) to be the price per acre of the lot w. Then E(X) is the average 
price per acre of the whole parcel and is given by 


(800) — a + (900) 12. a + (1000) 12. ~ + (1200) 22. ~ + (800) /> vA 


+ (900) 2. 6 + (800) 22. = = 890; 


namely $890 per acre. This can also be computed by first lumping together 
all acreage at the same price, and then summing over the various prices; 


15 
(800)( 525 + 100 + ino) + emtico* io) + (1000) 12 100 Oy (1200) 22 aN 


= (800) 2 + (900) 22 00 0 + (1000) 12. 00 0 + (1200) 12. a = 890. 


100 


The adjective in ‘‘mathematical expectation” is frequently omitted, and 
it is also variously known as “expected value,” “‘mean (value)’’ or ‘“‘first 
moment”’ (see §6.3 for the last term). In any case, do not expect the value 
E(X) when X is observed. For example, if you toss a fair coin to win $1 or 
nothing according as it falls heads or tails, you will never get the expected 
value $.50! However, if you do this 1 times and n is large, then you can 
expect to get about n/2 dollars with a good probability. This is the implica- 
tion of the Law of Large Numbers, to be made precise in §7.6. 

We shall now amplify on the condition given in (4.3.12). Of course it is 
automatically satisfied when Q is a finite space, but it is essential when Q 1s 
countably infinite. For it allows us to calculate the expectation in any old 
way by rearranging and regrouping the terms in the series in (4.3.11), without 
fear of getting contradictory results. In other words, if the series 1s absolutely 
convergent, then it has a uniquely defined “‘sum” which in no way depends 
on how the terms are picked out and added together. The fact that contra- 
dictions can indeed arise if this condition is dropped may be a surprise to 
you. If so, you will do well to review your knowledge of the convergence and 
absolute convergence of a numerical series. This is a part of the calculus course 
which is often poorly learned (and taught), but will be essential for proba- 
bility theory, not only in this connection but generally speaking. Can you, 
for instance, think of an example where the series in (4.3.11) converges but 
the one in (4.3.12) does not? [Remember that the p,’s must satisfy the con- 
ditions in (4.3.10), though the 2,’s are quite arbitrary. So the question is a 
little harder than just to find an arbitrary non-absolutely convergent series; 
but see Exercise 21.| In such a case the expectation is not defined at all. The 
reason why we are being so strict is: absolutely convergent series can be 
manipulated in ways that non-absolutely [conditionally] convergent series 
cannot be. Surely the definition of E(X) would not make sense if its value 
could be altered simply by shuffling around the various terms in the series in 
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(4.3.11), which merely means that we enumerate the sample points in a dif- 
ferent way. Yet this can happen without the condition (4.3.12)! 

Let us state explicitly a general method of calculating E(X) which is 
often expedient. Suppose the sample space 2 can be decomposed into disjoint 
sets A,: 


(4.3.13) Q= U An 


in such a way that X takes the same value on each A,. Thus we may write 
(4.3.14) X(w) = a, for w€ An, 
where the a,’s need not be all different. We have then 


(4.3.15) E(X) = ¥ P(An)an. 


This is obtained by regrouping the w’s in (4.3.11) first into the subsets A,, 
and then summing over all n. In particular if (, ve, ..., Un, . . .) is the range 
of X, and we group the sample points w according to the values of X(w), 
le., putting 


A, = {w | X(w) = Un} P(An) = Dns 


then we get 


(4.3.16) E(X) = Do Prdn; 


n 


where the series will automatically converge absolutely because of (4.3.12). 
In this form it is clear that the expectation of X is determined by its proba- 
bility distribution. 

Finally, it is worthwhile to point out that the formula (4.3.11) contains 
an expression for the expectation of any function of X: 


EX) = FA X@)Pe}) 


with a proviso like (4.3.12). For by Proposition 2 or rather a simpler analogue, 
g(X ) is also a random variable. It follows that we have 


(4.3.17) E(A(X)) = 2 Pne(0n), 


n 


where the v,’s are as in (4.3.16), but note that the ¢(v,)’s need not be distinct. 
Thus the expectation of g(X) is already determined by the probability distri- 
bution of X (and of course also by the function ¢), without the intervention 
of the probability distribution of g(X) itself. This is most convenient in calcu- 
lations. In particular, for g(x) = x* we get the rth moment of X: 
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(4.3.18) E(X') = DS pavh, 


see §6.3. 


4.4. Integer-valued random variables 


In this section we consider random variables which take only nonnegative 
integer values. In this case it is convenient to consider the range to be the 
entire set of such numbers: 


N° = {0,1,2,...,”,...} 


since we can assign probability zero to those which are not needed. Thus we 
have, as specialization of (4.3.6), (4.3.8) and (4.3.11): 


Pra = P(X =n) nEN?; 
(4.4.1) F(x) = 9 2a Pr 


EX) = 2 MPn. 


Since all terms in the last-written series are nonnegative, there is no difference 
between convergence and absolute convergence. Furthermore, since such a 
series either converges to a finite sum or diverges to ++», we may even allow 
E(X) = + in the latter case. This is in contrast to our general definition in 
the last section, but is a convenient extension. 

In many problems there is practical justification to consider the random 
variables to take only integer values, provided a suitably small unit of meas- 
urement is chosen. For example, monetary values can be expressed in cents 
rather than dollars, or one tenth of a cent if need be; if “inch” is not a small 
enough unit for lengths we can use one hundredth or one thousandth of an 
inch. There is a unit called angstrom (A) which is equal to 10—7 of a millimeter, 
used to measure electromagnetic wavelengths. For practical purposes, of 
course, incommensurable magnitudes (irrational ratios) do not exist; at one 
time 7 was legally defined to be 3.14 in some state of the United States! But 
one can go too far in this kind of justification! 

We proceed to give some examples of (4.4.1). 


Example 7. Suppose L is a positive integer, and 


L 
Then automatically all other p,’s must be zero because >> pr, = L - ; 
n=1 


and the conditions in (4.3.10) must be satisfied. Next, we have 
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L 1 LL+1 L+1 
E(X) = nap PDL EET 


The sum above is done by a formula for arithmetical progression which you 
have probably learned in school. 

We say in this case that X has a uniform distribution over the set 
{1, 2,..., LZ}. In the language of Chapter 3, the L possible cases {X = 1}, 
{X = 2}, ..., {X = L} are all equally likely. The expected value of X is 
equal to the arithmetical mean [average] of the L possible values. Here is an 
illustration of its meaning. Suppose you draw at random a token X from a 
box containing 100 tokens valued at 1¢ to 100¢. Then your expected prize is 
given by E(X) = 50.5¢. Does this sound reasonable to you? 


Example 8. Suppose you toss a perfect coin repeatedly until a head turns 
up. Let X denote the number of tosses it takes until this happens, so that 
{X = n} means n — | tails before the first head. It follows from the discus- 
sion in Example 8 of §2.4 that 


(4.4.3) pn = P(X =n) = 


because the favorable outcome is just the specific sequence TT --- TH. What 


n-—1 times 


is the expectation of X? According to (4.4.1), it is given by the formula 


= ? 


(4.4.4) >> 


n 


n 
1 2” 


Let us learn how to sum this series, though properly speaking this does not 
belong to this course. We begin with the fountainhead of many of such series: 


(4.4.5) es ree re ee yx for |x| < 1. 


l1—x 


This is a geometric series of the simplest kind which you surely have seen. 
Now differentiate it term by term: 


(4.4.6) —! _ 1+ 2x + 3x7 + ---+nxvI1+...= > (n + 1)x" 
(1 — x) n=0 


for |x| < 1. 


This is valid because the radius of convergence of the power series in (4.4.5) 
is equal to 1, so such manipulations are legitimate for |x| < 1. [Absolute and 
uniform convergence of the power series is involved here.] If we substitute 
x = 1/2 in (4.4.6) we obtain 
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(4.4.7) 4= > 4+ n(5)" 
n=0 


There is still some difference between (4.4.4) and the series above, so a little 
algebraic manipulation is needed. One way is to split up the terms above: 


where we have summed the second series by substituting x = 1/2 into (4.4.5). 
Thus the answer to (4.4.4) is equal to 2. Another way to manipulate the 
formula is to change the index of summation: n + | = v. Then we have 


which of course yields the same answer. Both techniques are very useful! 

The expectation E(T) = 2 seems eminently fair on intuitive grounds. For 
if the probability of your obtaining a head is 1/2 on one toss, then two tosses 
should get you 2-1/2 = 1 head, on the average. This plausible argument 
[which was actually given in a test paper by a smart student] can be made 
rigorous, but the necessary reasoning involved is far more sophisticated than 
you might think. It is a case of Wald’s equation} or martingale theorem {for 
advanced reader]. 

Let us at once generalize this problem to the case of a biased coin, with 
probability p for head and g = 1 — p tail. Then (4.4.3) becomes 


(4.4.8) Pr=(QG:::Qp=a)P, 
V+, 


n—1 times 


and (4.4.4) becomes 


a) _ a) p Pp l 
4.4.9 nlp = l)q" = == 
4.4.9) Leng p=P Lat De = Gao pp 
The random variable X is called the waiting time, for head to fall, or more 
generally for a “success.” The distribution {q*p;n = 1, 2,. . .} will be called 
the geometrical distribution with success probability p. 


Example 9. A perfect coin is tossed n times. Let S, denote the number of 
heads obtained. In the notation of §2.4, we have S, = Xi +--+ + Xn. We 
know from §3.2 that 


(4.4.10) pe = P(S, = k) = - (i) O<k<n. 


+ Named after Abraham Wald (1902-1950), leading U.S. statistician. 
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If we believe in probability, then we know > Py = 1 from (4.3.10). Hence 
k=0 


n |] [n n(n 
(4.4.11) 2, @ = 1 or 2, @ = 2”, 
This has been shown in (3.3.7) and can also be obtained from (4.4.13) below 
by putting x = 1 there, but we have done it by an argument based on proba- 
bility. Next we have 


(4.4.12) HS, = 3 x (z) 


k=0 


Here again we must sum a series, a finite one. We will do it in two different 
ways, both useful for other calculations. First by direct manipulation, the 
series may be rewritten as 


a (Cae © | (au 
xX 2” ki(n — k)! = om 2 (k — 1)'(n — b)! = 3 2 Ne — 1 


What we have done-above is to cancel k from k!, split off n from n! and omit 
a zero term for k = 0. Now change the index of summation by putting 
k — 1 = j (we have done this kind of thing in Example 8): 


ns n—I\_ n*<1 (n—1 _ ar — 

mo (a ) 2 ( j ) aie 2 

where the step before the last is obtained by using (4.4.11) with n replaced 
by n — 1. Hence the answer is /2. 

This method is highly recommended if you enjoy playing with combina- 
torial formulas such as the binomial coefficients. But most of you will prob- 
ably find the next method easier because it is more like a cook-book recipe. 
Start with Newton’s binomial theorem in the form: 


(4.4.13) (l+x) = x (7) xt, 


Observe that this is just an expression of a polynomial in x and is a special 
case of Taylor’s series, just as the series in (4.4.5) and (4.4.6) are. Now dif- 
ferentiate to get 


(4.4.14) nltxyot= 3 @ kext, 
kao \K 
Substitute x = 1: 
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divide through by 2” and get the answer n/2 again for (4.4.12). So the expected 
number of heads in n tosses is n/2. Once more, what could be more reasonable 
since heads are expected half of the time! 

We can generalize this problem to a biased coin too. Then (4.4.10) 
becomes 


(4.4.15) P(S, = k) = (;) pig, O<k<n. 


There is a preview of the above formula in §2.4. We now see that it gives the 
probability distribution of the random variable S, = y X;. It is called the 
i=1 


binomial distribution B(n; p). The random variable X; here as well as its dis- 
tribution is often referred to as Bernoullian; and when p = 1/2 the adjective 
symmetric is added. Next, (4.4.12) becomes 


(4.4.16) 3 (;) kp*q’—* = np. 
n=0 \K 
Both methods used above still work. The second is quicker: setting x = i 


in (4.4.14), we obtain since p + g = 1, 


p mt n _ n (;) (2) 
n(1 +4) qr" ~~ k K q/ 


multiplying through by pq”—! we establish (4.4.16). 


For another important example of nonnegative integer-valued ran- 
dom variable having a Poisson distribution, see §7.1. 


4.5. Random variables with densities 


In the preceding sections we have given a quite rigorous discussion of random 
variables which take only a countable set of values. But even at an elementary 
level there are many important questions in which we must consider random 
variables not subject to such a restriction. This means that we need a sample 
space which is not countable. Technical questions of “‘measurability” then 
arise which cannot be treated satisfactorily without more advanced mathe- 
matics. As we have mentioned in Chapter 2, this kind of difficulty stems from 
the impossibility of assigning a probability to every subset of the sample 
space when it is uncountable. The matter is resolved by confining ourselves 
to sample sets belonging to an adequate class called a Borel field; see Ap- 
pendix 1. Without going into this here we will take up a particular but very 
important situation which covers most applications and requires little mathe- 
matical abstraction. This is the case of random variable with a “density.” 
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Consider a function f defined on R! = (—x, +0): 
u— f(u) 


and satisfying two conditions: 


(i) Vu: f(u) > 0; 


(4.5.1) 7 
(ii) i f(u) du = 1. 

Such a function is called a density function on R'. The integral in (ii) is the 
Riemann integral taught in calculus. You may recall that if f is continuous or 
just piecewise continuous, then the definite integral 


OL. 


exists for any interval [a, b]. But in order that the “improper integral” over 
the infinite range (—©, +0) should exist, further conditions are needed to 
make sure that f(u) is pretty small for large |u|. In general, such a function is 
said to be “integrable over R'.”” The requirement that the total integral be 
equal to one is less serious than it might appear, because if 


[f@ du=M<o, 


we can just divide through by M and use f/M instead of f. Here are some 
possible pictures of density functions, some smooth, some not so. 


Graphs of density functions 


Figure 20 
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You see what a variety they can be. The only constraints are that the curve 
should not lie below the x-axis anywhere, that the area under the curve 
should have a meaning, and the total area should be equal to one. You may 
agree that this is not asking for too much. 

We can now define a class of random variables on an arbitrary sample 
space as follows. As in §4.2, X is a function on Q: w — X(w), but its probabil- 
ities are prescribed by means of a density function f so that for any interval 
[a, b] we have 


(4.5.2) Pa<X<b)= f ° ¢(u) du. 


More generally, if A is the union of intervals not necessarily disjoint and 
some of which may be infinite, we have 


(4.5.3) P(X € A)= [. f(u) du. 


Such a random variable is said to have a density, and its density function is f. 
[In some books this is called a “‘continuous” random variable, whereas the 
kind discussed in §2 is called “discrete.’’ Both adjectives are slightly mislead- 
ing so we will not use them here. | 

If A is a finite union of intervals, then it can be split up into disjoint ones, 
some of which may abut on each other, such as 


k 
A=U [a;, b;], 
j=l 
and then the right hand side of (4.5.3) may be written as 


I, f(u) du = x [ " f(u) du. 


This is a property of integrals which is geometrically obvious when you con- 
sider them as areas. Next if A = (—, x], then we can write 


(4.5.4) F(x) = P(X <x) = [. f(u) du: 


compare with (4.3.8). This formula defines the distribution function F of X 
as a primitive [indefinite integral] of f. It follows from the fundamental theo- 
rem of calculus that if f is continuous, then f is the derivative of F: 


(4.5.5) F'(x) = f(x). 
Thus in this case the two functions f and F mutually determine each other. Iff 


is not continuous everywhere, (4.5.5) is still true for every x at which / is 
continuous. These things are proved in calculus. 
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Let us observe that in the definition above of a random variable with a 
density, it is implied that the sets {a < X < b} and {X € A} have probabili- 
ties assigned to them, in fact they are specified in (4.5.2) and (4.5.3) by means 
of the density function. This is a subtle point in the wording that should be 
brought out, but will not be elaborated on. [Otherwise we shall be getting 
into the difficulties that we are trying to circumvent here. But see Appendix 
1.] Rather, let us remark on the close resemblance between the formulas 
above and corresponding ones in §4.3. This will be amplified by a definition of 
mathematical expectation in the present case and listed below for comparison. 


Countable case Density Case 
Range mn=1,2,... —o <u< +o 
element of probability Pn F(u) du = dF(u) 
Pas X<b) DP. [r@ du 
a <vn <b a 
P(X < x) = F(x) yy Ps [’. f(u) du 
E(X) X Pat [ ¢@ du 
proviso > Pn|Yn| < [. |ul f(u) du < 0 


More generally, the analogue of (4.3.17) is 


(4.5.6) EAX) = ["_ wf du 


You may ignore the second item in the density case above involving a differ- 
ential if you don’t know what it means. 

Further insight into the analogy is gained by looking at the following 
picture: 


Figure 21 
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The curve is the graph of a density function f. We have divided up the x-axis 
into m + 1 pieces, not necessarily equal and not necessarily small, and denote 
the area under the curve between x, and Xn41 by pn, thus: 


n+l 


Po = |. f@wduy |l<in<m. 
It is clear that we have 


Wn: Pn =O; d Pa = 1. 


Hence the numbers p,, satisfy the conditions in (4.3.10). Instead of a finite 
partition we may have a countable one by suitable labeling such as ..., 
P-2, P~1y Pos Pty... Thus we can derive a set of “‘elementary probabilities” 
from a density function, in infinitely many ways. This process may be called 
discretization. If X has the density f, we may consider a random variable Y 
such that 


P(Y = Xn) = Pn 


where we may replace x, by any other number in the subinterval [xn, Xn41]. 
Now if f is continuous and the partition is sufficiently fine, namely if the 
pieces are sufficiently small, then it is geometrically evident that Y is in some 
sense a discrete approximation of X. For instance 


ECY) = 2. PnXn 


will be an approximation of E(XY) = [. uf (u) du. Remember the Riemann 


sums defined in calculus to lead to a Riemann integral? There the strips with 
curved tops in Figure 21 are replaced by flat-tops (rectangles), but the ideas 
involved are quite similar. From a practical point of view, it is the discrete 
approximations that can be really measured, whereas the continuous density 
is only a mathematical idealization. We shall return to this in a moment. 

Having dwelled on the similarity of the two cases of random variable, we 
will pause to stress a fundamental difference between them. If X has a density, 
then by (4.5.2) with a = b = x, we have 


(4.5.7) P(X = x)= [ * ¢(u) du = 0. 


Geometrically speaking, this merely states the trivial fact that a line segment 
has zero area. Since x is arbitrary in (4.5.7), it follows that X takes any pre- 
assigned value with probability zero. This is in direct contrast to a random 
variable taking a countable set of values, for then it must take some of these 
values with positive probability. It seems paradoxical that on the one hand, 
X(w) must be some number for every w, and on the other hand any given number 
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has probability zero. The following simple concrete example should clarify 
this point. 


Example 10. Spin a needle on a circular dial. When it stops it points at a 
random angle 6 (measured from the horizontal, say). Under normal condi- 
tions it is reasonable to suppose that 6 is uniformly distributed between 0° and 
360° (cf. Example 7 of §4.4). This means it has the following density function: 


1 
— forO0<u< 360, 
0 otherwise. 


Thus for any 0, < 6. we have 


02 
_ 1 eh 
(4.5.8) PQ, < X < &) = [ 360 du = 360 


This formula says that the probability of the needle pointing between any 
two directions is proportional to the angle between them. If the angle 6. — 6, 
shrinks to zero, then so does the probability. Hence in the limit the probability 
of the needle pointing exactly at @ is equal to zero. From an empirical point 
of view, this event does not really make sense because the needle itself must 
have a width. So in the end it is the mathematical fiction or idealization of a 
“fine without width” that is the root of the paradox. 

There is a deeper way of looking at this situation which is very rich. It 
should be clear that instead of spinning a needle we may just as well “pick a 
number at random” from the interval [0, 1]. This can be done by bending 
the circle into a line segment and changing the unit. Now every point in [0, 1] 
can be represented by a decimal such as 


(4.5.9) .141592653589793 ---. 


There is no real difference if the decimal terminates because then we just 
have all digits equal to 0 from a certain place on, and 0 is no different from 
any other digit. Thus, to pick a number in [0, 1] amounts to picking all its 
decimal digits one after another. That is the kind of thing a computing ma- 
chine churns out. Now the chance of picking any prescribed digit, say the 
first digit “‘1’’ above, is equal to 1/10 and the successive pickings form totally 
independent trials (see §2.4). Hence the chance of picking the 15 digits shown 
in (4.5.9) is equal to 


15 times 


If we remember that 10° is a billion this probability is already so small that 
according to Emile Borel [1871-1956; great French mathematician and one 
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of the founders of modern probability theory], it is terrestrially negligible and 
should be equated to zero! But we have only gone 15 digits in the decimals 
of the number 7 — 3, so there can be no question whatsoever of picking this 
number itself and yet if you can imagine going on forever, you will end up 
with some number which is just as impossible a priori as this  — 3. So here 
again we are up against a mathematical fiction—the real number system. 

We may generalize this example as follows. Let [a, b] be any finite, non- 
degenerate interval in R! and put 


Fw [p= - fora<u<b, 
u) = os 


LO otherwise. 


This is a density function and the corresponding distribution 1s called the uni- 
form distribution on [a, b]. We can write the latter explicitly: 


F(x) = [(a V A —a 


if you have a taste for such tricky formulas. 


Example 11. A chord is drawn at random in a circle. What is the probability 
that its length exceeds that of a side of an inscribed equilateral triangle? 
Let us draw such a triangle in a circle with center O and radius R, and 
make the following observations. The side is at distance R/2 from 0; its mid- 
point is on a concentric circle of radius R/2; it subtends an angle of 120 
degrees at 0. You ought to know how to compute the length of the side, but 
this will not be needed. Let us denote by A the desired event that a random 
chord be longer than that side. Now the length of any chord is determined 


Figure 22 
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by any one of the three quantities: its distance d from 0; the location of its 
midpoint M; the angle 6 it subtends at 0. We are going to assume in turn that 
each of these has a uniform distribution over its range and compute the 
probability of A under each assumption. 


(1) Suppose that dis uniformly distributed in [0, R]. This is a plausible 
assumption if we move a ruler parallel to itself with constant speed from a 
tangential position towards the center, stopping somewhere to intersect the 
circle in a chord. It is geometrically obvious that the event A will occur if and 
only if d < R/2. Hence P(A) = 1/2. 


(2) Suppose that M is uniformly distributed over the disk D formed by 
the given circle. This is a plausible assumption if a tiny dart is thrown at D 
and a chord is then drawn perpendicular to the line joining the hitting point 
to 0. Let D’ denote the, concentric disk of radius R/2. Then the event A will 
occur if and only if M falls within D’. Hence P(A) = P(M € D’) = (area of 
D’)/(area of D) = 1/4. 


(3) Suppose that 6 is uniformly distributed between zero and 360 degrees. 
This is plausible if one endpoint of the chord is arbitrarily fixed and the other 
is obtained by rotating a radius at constant speed to stop somewhere on the 
circle. Then it is clear from the picture that A will occur if and only if 6 is 
between 120 and 240 degrees. Hence P(A) = (240 — 120)/360 = 1/3. 


Thus the answer to the problem is 1/2, 1/4 or 1/3 according to the differ- 
ent hypotheses made. It follows that these hypotheses are not compatible 
with one another. Other hypotheses are possible and may lead to still other 
answers. Can you think of a good one? This problem was known as Bertrand’s 
paradox in the earlier days of discussions of probability theory. But of course 
the paradox is due only to the fact that the problem is not well-posed without 
specifying the underlying nature of the randomness. It is not surprising that 
the different ways of randomization should yield different probabilities, which 
can be verified experimentally by the mechanical procedures described. Here 
is a facile analogy. Suppose that you are asked how long it takes to go from 
your dormitory to the classroom without specifying whether we are talking 
about “walking,” “biking,” or “driving” time. Would you call it paradoxical 
that there are different answers to the question? 

We end this section with some other simple examples of random variables 
with densities. Another important case, the normal distribution, will be dis- 
cussed in Chapter 6. 


Example 12. Suppose you station yourself at a spot on a relatively serene 
country road and watch the cars that pass by that spot. With your stopwatch 
you can clock the time before the first car passes. This is a random variable T 
called the waiting time. Under certain circumstances it is a reasonable hy- 
pothesis that T has the density function below with a certain \ > 0: 


(4.5.10) f(u) = r»9e*, ua 0. 
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It goes without saying that f(u) = 0 for u < 0. The corresponding distribu- 
tion function is called the exponential distribution with parameter \, obtained 
by integrating f as in (4.5.4): 


& 
F(x) = [. f(u) du = f * em du = 1 — 


In particular if we put x = +, or better, let x —> © in the above, we see that 
f satisfies the conditions in (4.5.1), so it is indeed a density function. We have 


(4.5.11) PT < x)= F(x) =1-e%;; 
but in this case it is often more convenient to use the tail probability: 
(4.5.12) PT>x)=1- FX) =e™. 


This can be obtained directly from (4.5.3) with A = (x, ©), thus: 


PT € (x,~)) = f 


he du = [ ” Ne du = e™, 
oo ) zx 
For every given x, say 5 (seconds), the probability e~* in (4.5.12) decreases 
as \ increases. This means your waiting time tends to be shorter if \ is larger. 
On a busy highway ) will be large indeed. The expected waiting time is given 
by 


(4.5.13) E(T) = I ure du = t { te-' dt = 1 
0 A Jo dN 


[Can you compute the integral above using “integration by parts” without 
recourse to a table?] This result supports our preceding observation that T 
tends on the average to be smaller when ) is larger. 

The exponential distribution is a very useful model for various types of 
waiting time problems such as telephone calls, service times, splitting of 
radioactive particles, etc.; see §7.2. 


Example 13. Suppose in a problem involving the random variable T above, 
what we really want to measure is its logarithm (to the base e): 


(4.5.14) S = log T. 


This is also a random variable (cf. Proposition 2 in §5.2); it is negative if 
T < 1, zero if T = 1 and positive if T > 1. What are its probabilities? We 
may be interested in P(a < S < 5) but it is clear that we need only find 
P(S < x), namely the distribution function Fs of S. Now the function 


x — log x 
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is monotone and its inverse is 


x— e’, 
So that 
S<xelogT< xeT K< e+. 


Hence by (4.5.11) 
F(x) = P{S < x} = P{T < e} = 1—e-# 
The density function fs is obtained by differentiating: 
fs(x) = Fs(x) = de*e—** = etm” 


This looks formidable but you see it is easily derived. 


Example 14. A certain river floods every year. Suppose the low water mark 
is set at 1, and the high water mark Y has the distribution function 


(4.5.15) F(y) = PY <y)=1- a l<y<o, 


2 


Observe that F(1) = 0, that F(y) increases with y and F(y) > 1 as yo, 
This is as it should be from the meaning of P(Y < y). To get the density 
function we differentiate: 


(4.5.16) f(y) = Fy) = 5 l<y<o, 


It is not necessary to check that [. I (y) dy = 1, because this is equivalent 
to lim F(y) = 1. The expected value of Y is given by 
ym 00 


“ 2 “ 2 
EY) = | u- 2 awe | 7 du = 2. 


Thus the maximum of Y is twice that of the minimum, on the average. 

What happens if we set the low water mark at 0 instead of 1, and use a 
unit of measuring the height which is 1/10 of that used above? This means we 
set 


(4.5.17) Z = 10(/Y — 1). 
As in Example 13 we have 
Z<zel10¥—1)<ze Y<l+ip 0<z<om, 


From this we can compute: 
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(10 + z)? 


200 
fx(z) = (i0 + D* 


F(z) = | 


The calculation of E(Z) from fz is tedious but easy. The answer is E(Z) = 10 
and comparing with E(Y) = 2 we see that 


(4.5.18) E(Z) = 10(E(Y) — 1). 


Thus the means of Y and Z are connected by the same linear relation as the 
random variables themselves. Does this seem obvious to you? The general 
proposition will be discussed in §6.1. 


4.6. General case 


The most general random variable is a function X defined on the sample 
space 2 such that for any real x, the probability P(X < x) is defined. 

To be frank, this statement has put the cart before the horse. What comes 
first is a probability measure P defined on a class of subsets of 2. This class 
is called the sample Borel field or probability field and is denoted by §. Now 
if a function X has the property that for every x, the set {w | X(w) < x} belongs 
to the class §, then it is called a random variable. [We must refer to Appendix 
1 for a full description of this concept; but the rest of this section should be 
intelligible without the formalities.] In other words, an arbitrary function 
must pass a test to become a member of the club. The new idea here 1s that 
P is defined only for subsets in $, not necessarily for all subsets of Q. If 
it happens to be defined for all subsets, then of course the test described 
above becomes a nominal one and every function is automatically a random 
variable. This is the situation for a countable space 2 discussed in §4.1. In 
general, as we have hinted several times before, it is impossible to define a 
probability measure on all subsets of 2, and so we must settle for a certain 
class §. Since only sets in § have probabilities assigned to them, and since we 
wish to discuss sample sets of the sort ““X < x,” we are obliged to require that 
these belong to 5. Thus the necessity of such a test is easy to understand. 
What may be a little surprising is that this test is all we need. Namely, once 
we have made this requirement, we can then go on to discuss the probabilities 
of a whole variety of sample sets such as {a < X < b}, {X = x}, {X takes 
a rational value}, or some crazy thing like {e* > X?+ 1}. 

Next, we define for every real x: 


(4.6.1) F(x) = P(X < x) 
or equivalently for a < b: 


F(b) — F(a) = Pia< X <5); 
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and call the function F the distribution function of X. This has been done in 


previous cases but we no longer have the special representative in (4.3.8) or 
(4.5.4): 


F(x) = Pw FO) = [* fw) du 


in terms of elementary probability or a density function. As a matter of fact, 
the general F turns out to be a mixture of these two kinds together with a 
more weird kind (the singular type). But we can operate quite well with the F 
as defined by (4.6.1) without further specification. The mathematical equip- 
ment required to handle the general case, however, is somewhat more ad- 
vanced (at the level of a course like “Fundamental concepts of analysis’). 
So we cannot go into this but will just mention two easy facts about F. 


(i) F is monotone nondecreasing: namely x < x’ => F(x) < F(X’); 
(ii) F has limits 0 and 1 at —« and +o respectively: 


F(-©) = lm F(x)=0, F(4+o)= lim F(x) = 1. 
5 rte Roel.) x—> + 0 


Property (1) holds because if x < x’, then {X < x} C {X < x’}. Property 
(ii) 1s intuitively obvious because the event {X < x} becomes impossible as 
x— —o, and certain as x > +o, This argument may satisfy you but the 
rigorous proofs are a bit more sophisticated and depend on the countable 
additivity of P (see §2.3). Let us note that the existence of the limits in (ii) 
follows from the monotonicity in (i) and a fundamental theorem in calculus: 
a bounded monotone sequence of real numbers has a limit. 

The rest of the section is devoted to a brief discussion of some basic 
notions concerning random vectors. This material may be postponed until it 
is needed in Chapter 6. 

For simplicity of notation we will consider only two random variables 
X and Y, but the extension to any finite number is straightforward. We first 
consider the case where X and Y are countably-valued. Let X take the values 
{xi}, Y take the values {y,}, and put 


(4.6.2) P(X = Xi, Y = yj) = P(X, yp). 


When x; and y, range over all possible values, the set of ‘‘elementary proba- 
bilities’ above gives the joint probability distribution of the random vector 
(X, Y). To get the probability distribution of X alone, we let y, range over all 
possible values in (4.6.2), thus: 


(4.6.3) P(X = x) = 2X P% Ys) = P(X *) 


where the last quantity is defined by the middle sum. When x; ranges over all 
possible values, the set of p(x., *) gives the marginal distribution of X. The 
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marginal distribution of Y is similarly defined. Let us observe that these 
marginal distributions do not in general determine the joint distribution. 

Just as we can express the expectation of any function of X by means of 
its probability distribution (see (4.3.16)), we can do the same for any function 
of (X, Y) as follows: 


(4.6.4) E(e(X, Y)) = re 2. AX, y) P(X, y,). 


It is instructive to see that this results from a rearrangement of terms in the 
definition of the expectation of g(X, Y) as one random variable as in (4.3.11): 


E(oAX, Y)) = 1 eX), Yo) PO). 


Next, we consider the density case extending the situation in §4.5. The 
random vector (X, Y) is said to have a joint density function f in case 


(4.6.5) PX<x¥<y= f/f fundud 


for all (x, y). It then follows that for any “‘reasonable”’ subset S of the Car- 
tesian plane (called a Borel set) we have 


(4.6.6) P(X, Y)€ S) = i f (u,v) du do. 
S 


For example S may be polygons, disks, ellipses and unions of such shapes. 
Note that (4.6.6) contains (4.6.5) as a very particular case and we can, at a 
pinch, accept the more comprehensive condition (4.6.6) as the definition of f 
as density for (X, Y). However, here is a heuristic argument from (4.6.5) to 
(4.6.6). Let us denote by R(x, y) the infinite rectangle in the plane with sides 
parallel to the coordinate axes and lying to the southwest of the point (x, y). 
The picture below shows that for any 6 > 0 and 6’ > 0: 


R(x + 6,y+ 5’) ~ R(x + 6, y) ~ R(x, y + 6") + R(x, y) 


is the shaded rectangle 


(x, y + 6’) (x+6,y+6') 


— — a eee ee 


(x, y) (x +, y) 


It follows that if we manipulate the relation (4.6.5) in the same way, we get 


P(x<X<x+5,y< ¥<ytey=[ UT f (ty 0) dado 
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This means (4.6.6) is true for the shaded rectangle. By varying x, y as well as 
5, 6’ we see that the formula is true for any rectangle of this shape. Now any 
reasonable figure can be approximated from inside and outside by a number 
of such small rectangles (even just squares)—a fact known already to the 
ancient Greeks. Hence in the limit we can get (4.6.6) as asserted. 

The curious reader may wonder why a similar argument was not given 
earlier for the case of one random variable in (4.5.3)? The answer is: heuristi- 
cally speaking, there are hardly any sets in R! other than intervals, points and 
their unions! Things are pretty tight in one dimension and our geometric 
intuition does not work well. This is one reason why classical measure theory 
is a Sophisticated business. 

The joint density function f satisfies the following conditions: 


(1) f(u, v) > O for all (u, v); 
Gi) [" [" fq») dudv = 1, 


Of course (ii) implies that f is integrable over the whole plane. Frequently we 
assume also that f is continuous. Now the formulas analogous to (4.6.3) are 


P(X <x) = | * f(u,*)du, where f(u, *) = [’. f (u,v) dv 
(4.6.7) - 


P(Y<y)= [fe v) dv, where f(*,v) = ["f@ v) du. 


The functions u— f(u, *) and v > f(*, v) are called respectively the marginal 
density functions of X and Y. They are derived from the joint density function 
after “integrating out” the variable which is not in question. 

The formula corresponding to (4.6.4) becomes in the density case: for 
any “‘reasonable” [Borel] function ¢: 


(4.6.8) E(o(X, Y)) =- [. [°, ou DF u, 0) du do. 


The class of reasonable functions includes all bounded continuous functions 
in (u,v), indicators of reasonable sets, and functions which are continuous 
except across some smooth boundaries, for which the integral above exists, 
etc. 


In the most general case the joint distribution function F of (X, Y) is defined 
by 
(4.6.9) F(x, y) = P(X < x, Y < y) for all (x, y). 
If we denote lim F(x, y) by F(x, ©), we have 
Yy-> © 
F(x,©) = POX < x, Y¥< 0) = P(X < x) 


since ““Y < o” puts no restriction on Y. Thus x > F(x, ©) is the marginal 
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distribution function of X. The marginal distribution function of Y is similarly 
defined. 

Although these general concepts form the background whenever several 
random variables are discussed, explicit use of them will be rare in this book. 


Exercises 


1. If X 1s a random variable [on a countable sample space], is it true that 
X+X=2X,X- X=0? 
Explain in detail. 
2. LetQ = {wn, W2, ws}, P(a) = P(e») = P(ws) = : and define X, Y and Z 
as follows: 
X(o1) = 1, X(@2) = 2, Xs) = 3; 
Y(o1) = 2, Y(w2) = 3, Y(ws) = 1; 
Z(w1) = 3, Z(we) = 1, Z(w3) = 2. 


Show that these three random variables have the same probability dis- 
tribution. Find the probability distributions of X + Y, Y+ Z and 
Z+ X. 

3. In No. 2 find the probability distribution of 


rrr Z 
_ 2 2 —__“___, 
X+ Y-Z, V(X?+ YZ, x— v1 
4. Take Q to be a set of 5 real numbers. Define a probability measure and a 
random variable X on it which takes the values 1, 2, 3, 4, 5 with probabil- 
1 11412 


ities 10° 10° $° 5° 5 respectively; another random variable Y which takes 


the value V2, V 3, a with probabilities I = * Find the probability dis- 
tribution of XY. [Hint: the answer depends on your choice and is not 
unique. | 

5. Generalize No. 4 by constructing 0, P, X so that X takes the values 
V1, V2... Un With probabilities pi, po,..., Pn Where the p,’s satisfy 
(4.3.10). 

6. In Example 3 of §1, what do the following sets mean? 


{K+ Y=}, (X+ Y¥<T (XV Y>4, (X# N}. 


List all the w’s in each set. 
7.* Let X be integer valued and let F be its distribution function. Show that 
for every x anda < b: 


Exercises 105 


10.* 


11. 


12. 


14. 


15. 


P(X = x)= lim [F(x + «) — F(x — «)], 
ev0 
Pa<xX<b)= lim [F(b — «) — Fa+ 6]. 
ev0 
[The results are true for any random variable, but require more ad- 


vanced proofs even when 2 is countable. | 
In Example 4 of §4.2, suppose that 


X = 5000 + X’ 


where X’ is uniformly distributed over the set of integers (dollars) from 
1 to 5000. What does this hypothesis mean? Find the probability dis-' 
tribution and mean of Y under this hypothesis. 

As in No. 8 but now suppose that 


X = 4000 + X’ 


where X’ is uniformly distributed from 1 to 10000. 
As in No. 8 but now suppose that 


X = 3000 + X’ 
and X’ is the exponential distribution with mean 7000. Find E(Y). 
Let \ > 0 and define f as follows: 


he,  ifu> 0, 


f@ = 
jem, ifu< 0. 


hNl— bh] — 


This f is called bilateral exponential. If X has density f, find the density 
of |X|. [Hint: begin with the distribution function. ] 
If X is a positive random variable with density f, find the density of 


4+VX. Apply this to the distribution of the side-length of a square when 
its area is uniformly distributed in [a, 5]. 
If X has density f find the density of X2. [Hint: P(X? < x) = Fx(Vx) 
_ Fy(—V x).] 
Prove (4.4.5) in two ways: (a) by multiplying out (1 — x)(1 + x + 
-++ + x”), (b) by using Taylor’s series. 
Suppose that 
Pn = cq? 'p, 1 n<cm; 

where c is a constant and m is a positive integer, cf. (4.4.8). Determine c 
so that y Pn = 1. (This scheme corresponds to the waiting time for a 

n=1 


success when the number of trials is limited in advance.) 
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16. 


17. 


18.* 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


Random Variables 


A perfect coin is tossed n times. Let Y, denote the number of heads 
obtained minus the number of tails. Find the probability distribution of 
Y,, and its mean. [Hint: there 1s a simple relation between Y, and the S,, 
in Example 9 of §4.4.] 

Refer to Problem 1 in §3.4. Suppose there are 11 rotten apples in a 
bushel of 550, and 25 apples are picked at random. Find the probability 
distribution of the number X of rotten apples among those picked. 
Generalize No. 17 to arbitrary numbers and find the mean of X. [Hint: 
this requires some expertise in combinatorics but becomes trivial after 


§6.1.] 
Let 


n(n + 1) n> |. 


P(X = 1) = pp = 
Is this a probability distribution for X? Find P(X > m) for any m and 
E(X). 

If all the books in a library have been upset and a monkey is hired to 
put them all back on the shelves, it can be shown that a good approxi- 
mation for the probability of having exactly n books put back in their 
original places 1s 

—1 

c, n > 0. 

n!} 
Find the expected number of books returned to their original places. 
[This oft-quoted illustration is a variant on the matching problem dis- 
cussed in Problem 6 of §3.4.] 
Find an example in which the series >> Prd, in (4.3.11) converges but not 


absolutely. [Hint: there is really nothing hard about this: choose p, = 
1/2” say, and now choose », so that p,v, is the general term of any non- 
absolutely convergent series you know. ] 

If f and g are two density functions, show that Af + yg is also a density 
function, where\ + p= 1,X\>0,y> 0. 

Find the probability that a random chord drawn in a circle is longer 
than the radius. As in Example 11 of §4.5 work this out under the three 
different hypotheses discussed there. 

Let 


f(u) = ue, u= 0. 


Show that fis a density function. Find f " uf (u) du. 


In the figure below an equilateral triangle, a trapezoid and a semi-disk 
are shown: 
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26. 


Zt: 


28. 


29, 


30. * 


31. 


| \ 


Determine numerical constants for the sides and radius to make these 
the graphs of density functions. 

Suppose a target is a disk of radius 10 feet and suppose that the proba- 
bility of hitting within any concentric disk is proportional to the area of 
the disk. Let R denote the distance of the bullet from the center. Find 
the distribution function, density function and mean of R. 

Agent 009 was trapped between two narrow abysmal walls. He swung 
his gun around in a vertical circle touching the walls as shown in Fig. 23, 
and fired a wild [random] shot. Assume that the angle which his pistol 
makes with the horizontal is uniformly distributed between 0° and 90°. 
Find the distribution of the height where the bullet landed and its mean. 


[St. Petersburg Paradox] You play a game with your pal by tossing a 
perfect coin repeatedly and betting on the waiting time X until a head 1s 
tossed up. You agree to pay him 27¢ when the value of X is known, 
namely 2"¢ if X = n. If you figure that a fair price for him to pay you 
in advance in order to win this random prize should be equal to the 
mathematical expectation E(2*), how much should he pay? How much 
honestly would you accept to play this game? [If you do not see any 
paradox in this, then you do not agree with such illustrious mathe- 
maticians as Daniel Bernouli, D’Alembert, Poisson, Borel, to name only 
a few. For a brief account see [Keynes]. Feller believed that the paradox 
would go away if more advanced mathematics were used to reformulate 
the problem. You will have to decide for yourself whether it 1s not more 
interesting as a philosophical and psychological challenge. | 

One objection to the scheme in No. 28 is that “‘time must have a stop.”’ 
So suppose that only m tosses at most are allowed and your pal gets 
nothing if head does not show up in m tosses. Trym = 10 andm = 100. 
What is now a fair price for him to pay? and do you feel more com- 
fortable after this change of rule? In this case Feller’s explanation melts 
away but the psychological element remains. 

A number zp is called the median of the random variable X iff PLY > un) = 
1/2 and P(X < ») > 1/2. Show that such a number always exists but 
need not be unique. Here is a practical example. After n examination 
papers have been graded, they are arranged in descending order. There 
is one in the middie if n is odd, two if n is even, corresponding to the 
median(s). Explain the probability model used. 

An urn contains n tickets numbered from | to”. Two tickets are drawn 
(without replacement). Let X denote the smaller, Y the larger of the 
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- ee | >» Height 


Wall —(.-° 0." +l +(—— Wall 


Agent 009 
Figure 23 


two numbers so obtained. Describe the joint distribution of CX, Y), 
and the marginal ones. Find the distribution of Y — X from the joint 
distribution. 

32. Pick two numbers at random from [0, 1]. Define X and Y as in No. 31 
and answer the same questions. [Hint: draw the picture and compute 
areas. | 


Appendix I 


Borel Fields and General Random Variables 


When the sample space Q is uncountable it may not be possible to define a 
probability measure for all its subsets, as we did for a countable Q in §2.4. 
We must restrict the measure to sets of a certain family which must however 
be comprehensive enough to allow the usual operations with sets. Specifically, 
we require the family § to have two properties: 


(a) ifaset A belongs to §, then its complement 4° = Q — A also belongs to $; 
(b) if a countable number of sets Aj, A2,... all belong to 5, then their union 
A, also belongs to §. 


It follows from De Morgan’s laws that the union in (b) may be replaced by 


the intersection (\ A, as well. Thus if we operate on the members of the 
n 


family with the three basic operations mentioned above, for a countable num- 
ber of times, in any manner or order (see e.g. (1.3.3)), the result is still a 
member of the family. In this sense the family is said to be closed under these 
operations, and so also under other derived operations such as differences. 
Such a family of subsets of Q is called a Borel field on Q. In general there are 
many such fields, for example the family of all subsets which is certainly a 
Borel field but may be too large to have a probability defined on it; or the 
family of 2 sets {@, Q}, or four sets {@, A, A*, 2} with a fixed set A, which 
are too small for most purposes. Now suppose that a reasonable Borel field & 
has been chosen and a probability measure P has been defined on it, then we 
have a probability triple Q, 5, P) with which we can begin our work. The sets 
in § are said to be measurable and they alone have probabilities. 

Let X be a real-valued function defined on Q. Then X is called a random 
variable iff for any real number x, we have 


(A.1.1) {w | X(w) < x} ES. 


Hence P{X < x} is defined, and as a function of x it is the distribution func- 
tion F given in (4.6.1). Furthermore if a < b, then the set 


(A.1.2) fa< X<b} ={X¥< be -{X< gh 


belongs to § since & is closed under difference. Thus its probability is defined 
and is in fact given by F(b) — F(a). 

When © is countable and we take § to be the Borel field of all the subsets of 
Q, then of course the condition (A.1.1) is satisfied for any function X. Thus 
in this case an arbitrary function on Q is a random variable, as defined in §4.2. 
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In general, the condition in (A.1.1) is imposed mainly because we wish to 
define the mathematical expectation by a procedure which requires such a 
condition. Specifically, if X is a bounded random variable, then it has an 
expectation given by the formula below: 


(A.1.3) B(X) = tim y nbsP{nd < X < (n+ Vd}, 
r;) es) 


n= 


where the probabilities in the sum are well defined by the remark about 
(A.1.2). The existence of the limit in (A.1.3), and the consequent properties 
of the expectation which extend those discussed in Chapters 5 and 6, are part 
of a general theory known as that of Lebesgue integration [Henri Lebesgue 
(1875-1941), co-founder with Borel of the modern school of measure and 
integration.| We must refer the reader to standard treatments of the subject 
except to exhibit ECX) as an integral as follows: 


E(X) = [, X(@)P(de); 


cf. the discrete analogue (4.3.11) in a countable Q. 


Chapter 5 


Conditioning and Independence 


5.1. Examples of conditioning 


We have seen that the probability of a set A is its weighted proportion 
relative to the sample space 2. When Q 1s finite and all sample points have 
the same weight (therefore equally likely), then 


ray = ll 


as in Example 4 of §2.2. When Q is countable and each point w has the weight 
P(w) = P({w}) attached to it, then 


(5.1.1) az "oO 
A. P(A) = ¢*&4_— 
LP) 


from (2.4.3), since the denominator above is equal to 1. In many questions 
we are interested in the proportional weight of one set A relative to another 
set S. More accurately stated, this means the proportional weight of the part 
of A in S, namely the intersection A () S, or AS, relative to S. The formula 
analogous to (5.1.1) is then 


Po) 
O12) SP) 
wl_S 


Thus we are switching our attention from Q to S as a new universe, and con- 
sidering a new proportion or probability with respect to it. We introduce the 
notation 


_ P(AS) 

(5.1.3) P(A| S) = PS) 

and call it the conditional probability of A relative to S. Other phrases such as 
“given S,” “knowing S,” or “under the hypothesis [of | S’’ may also be used to 
describe this relativity. Of course if P(S) = 0 then the ratio in (5.1.3) becomes 
the “indeterminate” 0/0 which has neither meaning nor utility; so whenever 
we write a conditional probability such as P(A | S) we shall impose the proviso 
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that PCS) > 0 even if this is not explicitly mentioned. Observe that the ratio 
in (5.1.3) reduces to that in (5.1.2) when © is countable, but is meaningful in 
the general context where the probabilities of A and S are defined. The follow- 
ing preliminary examples will illustrate the various possible motivations and 
interpretations of the new concept. 


Example 1. All the students on a certain college campus are polled as to 
their reaction to a certain presidential candidate. Let D denote those who 
favor him. Now the student population 2 may be cross-classified in various 
ways, for instance according to sex, age, race, etc. Let 


A = female, B = black, C = of voting age. 


Then Q is partitioned as in (1.3.5) into 8 subdivisions ABC, ABC’,..., 
AcB°C’. Their respective numbers will be known if a complete poll is made, 
and the set D will in general cut across the various divisions. For instance 


. _ P(A°*BCD) 
P(D | A°BC) = P(ABC) 
denotes the proportion of male black students of voting age who favor the 
candidate; 


-) gern .. P(A°CD*) 
P(D¢| A°C) = P(A) 
denotes the proportion of male students of voting age who do not favor the 
candidate, etc. 


Example 2. A perfect die is thrown twice. Given [knowing] that the total 
obtained is 7, what is the probability that the first point obtained is k, 
l<k< 6? 

Look at the list in Example 3 of §4.1. The outcomes with total equal to 7 
are those on the “‘second diagonal” and their number is six. Among these 
there is one case in which the first throw is k. Hence the conditional proba- 
bility is equal to 1/6. In symbols, let X; and X, denote respectively the point 
obtained in the first and second throw. Then we have as a case of (5.1.3), 


_ _ _ Pim =k; M+ y= _ 1 

PIM = kl Xi + t= T= P{X, + X= 7} «6 

The fact that this turns out to be the same as the unconditional probability 
P{X, = k} is an accident due to the lucky choice of the number 7. It is the 
only value of the total which allows all six possibilities for each throw. As 
other examples, we have 
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Pih=k| ht %=6 =>) L<k<5, 


Pih=k|h+h=%9 =} 3<k< 6. 


Here it should be obvious that the conditional probabilities will be the same 
if X, and X, are interchanged. Why? 

Next, we ask the apparently simpler question: given X, = 4, what is the 
probability that X, = k? You may jump to the answer that this must be 1/6 
since the second throw is not affected by the first, so the conditional proba- 
bility P{X, = k| X, = 4$ must be the same as the unconditional one 
P{X, = k}. This is certainly correct provided we use the independence be- 
tween the two trials (see §2.4). For the present we can use (5.1.3) to get 


P{X,=4:X%, =k 361 
(5.1.4) P(X, = k| X= 4) = SE De 
6 


Finally, we have 


_ _ _ P{X = 4; XM + X= 7} 

(5.1.5) P{M + %=7(|M = 4 = pix =a = 

Without looking at the list of outcomes, we observe that the event {Xi = 4; 
X1 + X_ = 7} is exactly the same as {X%1 = 4; X, = 7 — 4 = 3}; so in effect 
(5.1.5) is a case of (5.1.4). This argument may seem awfully devious at this 
juncture, but is an essential feature of a random walk (see Chapter 8). 


Example 3. Consider the waiting time X in Example 8 of §4.4, for a biased 
coin. Knowing that it has fallen tails three times, what is the probability that 
it will fall heads within the next two trials? 

This is the conditional probability 


 Pa<X<K< 5) 
(5.1.6) PX <5|X>4D= PX > 4) 4) 
We know that 
(5.1.7) P(X =n) = q™"p, n=1,2,...;3 
from which we can calculate 
= q'p 
(5.1.8) PIX > 4) =F grip = AE = ai 
n=4 ~~ 


(how do we sum the series?) Again from (5.1.7), 


P4< X <5) = q*p + q'p. 
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Thus the answer to (5.1.6) is p + gp. Now we have also from (5.1.7) the 
probability that the coin falls heads (at least once in two trials): 


Pls X<2)=p+o@p. 


Comparing these two results, we conclude that the three previous failures 
do not affect the future waiting time. This may seem obvious to you a priori, 
but it is a consequence of independence of the successive trials. By the way, 
many veteran gamblers at the roulette game believe that “if reds have ap- 
peared so many times in a row, then it is smart to bet on the black on the 
next spin because in the long run red and black should balance out.’’ On the 
other hand, you might argue (with Lord Keynes on your side) that if red 
has appeared say ten times in a row, in the absence of other evidence, it 
would be a natural presumption that the roulette wheel or the croupier is 
biased toward the red, namely p > 1/2 in the above, and therefore the smart 
money should be on it. See Example 8 in §5.2 below for a similar discussion. 


Example 4. We shall bring out an analogy between the geometrical distribu- 
tion given in (5.1.2) [see also (4.4.8)] and the exponential distribution in 
(4.5.11). If X has the former distribution, then for any non-negative integer n 
we have 


(5.1.9) P(X > n) = qq". 


This can be shown by summing a geometrical series as in (5.1.8), but is obvious 
if we remember that ‘XY > n’” means that the first 1 tosses all show tails. It 
now follows from (5.1.9) that for any non-negative integers m and n, we 
have 

_ PX >n+m)_ gr 
(5.1.10) P(X >n+m|X> m) = —BX> mm) = oP 


= qr? = P(X > n). 


Now let T denote the waiting time in Example 12 of §4.5; then we have 
analogously for any non-negative real values of s and f: 


PIT > s+ 0) _ ereto 
PT>s) e™ 
=e t= P(T > 2). 


(5.1.11) PT >s+t|T>s)= 


This may be announced as follows: if we have already spent some time in 
waiting, the distribution of further waiting time is the same as that of the ini- 
tial waiting time as if we have waited in vain! A suggestive way of saying this 
is that the random variable T has no memory. This turns out to be a funda- 
mental property of the exponential distribution which is not shared by any 
other, and is basic for the theory of Markov processes. Note that although the 


t John Maynard Keynes [1883-1946], English economist and writer. 
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geometrical distribution is a discrete analogue as shown in (5.1.10), strictly 
speaking it does not have the ‘“‘memoryless’’ property because (5.1.10) may 
become false when n and m are not integers: take e.g. n = m = 1/2. 


Example 5. Consider all families with two children and assume that boys 
and girls are equally likely. Thus the sample space may be denoted sche- 
matically by 4 points: 


Q = {(bb), (bg), (gb), (gg)} 


where b = boy, g = girl; the order in each pair is the order of birth; and the 
4 points have probability 1/4 each. We may of course use instead a space of 
4N points, where N is a large number, in which the four possibilities have 
equal numbers. This will be a more realistic population model but the arith- 
metic below will be the same. 

If a family is chosen at random from Q, and found to have a boy in it, 
what is the probability that it has another boy, namely that it is of the type 
(6, b)? A quickie answer might be 1/2 if you jumped to the conclusion from 
the equal likelihood of the sexes. This is a mistake induced by a misplaced 
“relative clause’ for the conditional probability in question. Here is the de- 
tailed explanation. 

Let us put 


A = {w| there is a boy in w} 


B = {w| there are two boys in w}. 


Then B C A and so AB = B, thus 


This is the correct answer to the question. But now let us ask a similar 
sounding but really different question. If a child is chosen at random from 
these families and is found to be a boy, what is the probability that the other 
child in his family is also a boy? This time the appropriate representation 
of the sample space should be 


0 = {9,b,D9,Do} ’ 


where the sample points are not families but the children of these families, 
and g, = a girl who has a sister, g, = a girl who has a brother, etc. [Observe 
that here we have ordered the g’s before the b’s to allay possible criticism 
from Woman’s Liberation Movement.] Now we have 
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A = {@|@ is a boy} 
B = {| & has a brother} 


so that 
AB = {@| & = by}. 
Therefore 
l 
P(AB) 4 1 
PB| A) = Fey = T= 7 
2 


This is a wonderful and by no means artificial illustration of the impor- 
tance of understanding ‘‘what we are sampling” in statistics. 


5.2. Basic formulas 


Generally speaking, most problems of probability have to do with several 
events or random variables and it is their mutual relation or joint action that 
must be investigated. In a sense all probabilities are conditional because 
nothing happens in a vacuum. We omit the stipulation of conditions which 
are implicit or taken for granted, or if we feel that they are irrelevant to the 
situation in hand. For instance, when a coin is tossed we usually ignore the 
possibility that it will stand on its edge, and do not even specify whether it is 
Canadian or American. The probability that a certain candidate will win an 
election is certainly conditioned on his surviving the campaign—an assump- 
tion which has turned out to be premature in recent American history. 

Let us begin by a few simple but fundamental propositions involving con- 
ditional probabilities: 


Proposition 1. For arbitrary events A, Ao, ..., An, we have 
(5.2.1) 

P(AyA2.. . An) = P(A,)P(A2 | A1)P(A3 | A142)... P(An | ArA2. . » An—1) 
provided P(A,A2... An—1) > O. 


Proof: Under the proviso all conditional probabilities in (5.2.1) are well 
defined since 


P(A1) > P(A\A2) > +++ > P(A1A2.. . An—1) > 9. 
Now the right side of (5.2.1) is explicitly: 
P(A) P(A1A2) P(A1A42 43) P(AyA2 +++ An) 


eed 
rr rrrinn eve 


PQ) P(A) P(A\A:) *° P(AYAn «++ Ana) 


which reduces to the left side by successive cancellation. Q.E.D. 
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By contrast with the additivity formula (2.2.3) for a disjoint union, the 
formula (5.2.1) may be called the general multiplicative formula for the proba- 
bility of an intersection. But observe how the conditioning events are also 
“multiplied” step by step. A much simpler formula has been given in §2.4 
for independent events. As an important application of (5.2.1), suppose the 
random variables X%1, X2,..., Xn,... are all countably-valued; this is surely 
the case when Q is countable. Now for arbitrary possible values x1, x, . . 
Xny..., WE put 


A, = {X, = Xx}, k= 12,..., 
and obtain 


(5.2.2) P{X = X15 Xo = Xo,... ; Xn = Xn} 
= P{X = xy} P{X, = Xo | Xi = xy} P(X; = X3 | ¢ = X15 Xx? = Xo} 
-++ P{LX, = Xn | X1 = M,.. 2, Xn = Xn-r}. 
The first term above is called the joint probability of X,, Xo,..., Xn; So the 
formula expresses this by successive conditional probabilities. Special cases 


of this will be discussed later. 


Proposition 2. Suppose that 


is a partition of the sample space into disjoint sets. Then for any set B we have 
(5.2.3) P(B) = >> P(A,)P(B | Az). 
Proof: First we write 
B=QB= (2 An) B= > AnB 
by simple set theory, in particular (1.3.6); then we deduce 


P(B) = P (2 A,B) = )° P(A,B) 


by countable additivity of P. Finally we substitute 
P(A,B) = P(An)P(B | An) 
from the definition (5.1.3). This establishes (5.2.3); note that if P(A,) = 0 for 


some n, the corresponding term in the sum there may be taken to be 0 even 
though P(B | A,) is undefined. Q.E.D. 
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From now on we shall adopt the convention that x-0 = Oif xis undefined, 
in order to avoid repetition of such remarks as in the preceding sentence. 

The formula (5.2.3) will be referred to as that of total probability. Here. 
is a useful interpretation. Suppose that the event B may occur under a number 
of mutually exclusive circumstances (or “‘causes’’). Then the formula shows 
how its “‘total probability” is compounded from the probabilities of the vari- 
ous circumstances, and the corresponding conditional probabilities figured 
under the respective hypotheses. 

Suppose X and Y are two integer-valued random variables and &k is an 
integer. If we apply (5.2.3) to the sets 


A, = {X =n}, B= {Y =k}, 
we obtain 


(5.2.4) P(Y =k =ZXP(X =n)P(¥ = k| X=n) 


where the sum is over all integers n, and if P(X = n) = 0 the corresponding 
term may be taken to be 0. It is easy to generalize the formula when X takes 
values in any countable range, and when “‘Y = k” is replaced by e.g., 
“a < Y <b” for a more general random variable, not necessarily taking 
integer values. 


Proposition 3. Under the assumption and notation of Proposition 2, we have 
also 


P(A,)P(B | Az) 


(5.2.5) P(A» | BY = S04 PCB] An) 


provided P(B) > 0. 


Proof: The denominator above is equal to P(B) by Proposition 2, so the 
equation may be multiplied out to read 


P(B)P(A, | B) = P(A,)P(B | Az). 


This is true since both sides are equal to P(A, B). Q.E.D. 


This simple proposition with an easy proof is very famous under the name 
of Bayes’ Theorem, published in 1763. It is supposed to yield an ‘“‘inverse 
probability,” or probability of the ‘‘cause” A, on the basis of the observed 
“effect” B. Whereas P(A,) is the a priori, P(A, | B) is the a posteriori proba- 
bility of the cause 4,. Numerous applications were made in all areas of 
natural phenomena and human behavior. For instance, if B is a “body” and 
the A,’s are the several suspects of the murder, then the theorem will help 
the jury or court to decide the whodunit. [Jurisprudence was in fact a major 
field of early speculations on probability. | If B is an earthquake and the A,’s 
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are the different physical theories to explain it, then the theorem will help 
the scientists to choose between them. Laplace [1749-1827; one of the great 
mathematicians of all time who wrote a monumental treatise on probability 
around 1815] used the theorem to estimate the probability that “the sun will 
also rise tomorrow” (see Example 9 below). In modern times Bayes lent his 
name to a school of statistics. For our discussion here let us merely comment 
that Bayes has certainly hit upon a remarkable turn-around for conditional 
probabilities, but the practical utility of his formula is limited by our usual 
lack of knowledge on the various a priori probabilities. 

The following simple examples are given to illustrate the three proposi- 
tions above. Others will appear in the course of our work. 


Example 6. We have actually seen several examples of Proposition 1 before 
in Chapter 3. Let us re-examine them using the new notion. 

What is the probability of throwing six perfect die and getting six different 
faces? [See Example 3 of §3.1.] Number the dice from 1 to 6, and put: 


I 


A, = any face for Die 1, 
A, = Die 2 shows a different face from Die 1, 
A3 = Die 3 shows a different face from Die 1 and Die 2, 


etc. Then we have, assuming that the dice act independently: 


5 4 
P(A) = 1; P(A2| As) = 23 P(As | AiAe) = B-. 


6” . 3 P(Ag| A142 +++ As) = 


1 

6 

Hence an application of Proposition 1 gives 
5 


6 ! 
P(AAn ++ Ad) = FE eee Es GE 


The birthday problem [Problem 5 of §3.4] is now seen to be practically 
the same problem, where the number 6 above is replaced by 365. The se- 
quential method mentioned there is just another case of Proposition 1. 


Example 7. The family dog is missing after the picnic. Three hypotheses are 
suggested: 


(A) it has gone home; 
(B) it is still worrying that big bone in the picnic area; 
(C) it has wandered off into the woods. 


The a priori probabilities, which are assessed from the habits of the dog, are 
estimated respectively to be ; ; A child each is sent back to the picnic 
ground and the edge of the woods to look for the dog. If it is in the former 
area, it is a cinch (90%) that it will be found; if it is in the latter, the chance 
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is only a toss-up (50%). What is the probability that the dog will be found 
in the park? 

Let A, B, C be the hypotheses above, and let D = “dog will be found in 
the park.” Then we have the following data: 


l l l 
90 50 
P(D| A) = 0, P(D| B) = yy P(D| O) = 7 


Hence by (5.2.3), 


P(D) = P(A)P(D | A) + P(B)P(D | B) + P(C)P(D | C) 


] 1 90 ,1 50 115 
= 4°93" 1007 4° 100 ~ 200 
What is the probability that the dog will be found at home? Call this D’, and 
assume that P(D’| A) = 1, namely that if it is home it will be there to greet 


the family. Clearly P(D’ | B) = P(D’ | C) = 0 and so 


P(D') = P(A)P(D’ | A) + P(B)P(D' | B) + P(C)P(D' | C) 


I l l l 
=4°145:0+7-0=3 


What is the probability that the dog is “‘lost’’? It is 


ny 39 
1 — P(D) — P(D'’) = 500 

Example 8. Urn one contains 2 black and 3 red balls; urn two contains 3 
black and 2 red balls. We toss an unbiased coin to decide on the urn to draw 
from but we do not know which is which. Suppose the first ball drawn is 
black and it is put back, what is the probability that the second ball drawn ° 
from the same urn is also black? 

Call the two urns U, and U,; the a priori probability that either one is 
chosen by the coin-tossing is 1/2: 


I 


l 


Denote the event that the first ball is black by B,, that the second ball is black 
by Bo. We have by (5.2.5) 


P(Us | By) = j 
1 
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Note that the two probabilities must add up to one (why?) so we need only 
compute one of them. Note also that the two a posteriori probabilities are 
directly proportional to the probabilities P(B, | Ui) and P(B, | U2). That is, 
the black ball drawn is more likely to have come from the urn which is more 
likely to yield a black ball, and in the proper ratio. Now use (5.2.3) to compute 
the probability that the second ball is also black. Here A; = “B, 1s from Ui,” 
A, = “‘B, is from U,” are the two alternative hypotheses. Since the second 
drawing is conditioned on Bi, the probabilities of the hypotheses are really 
conditional ones: 


2 3 
P(A) = P(U: | Bi) = = P(A2) = P(U2| Bi) = & 
On the other hand, it is obvious that 
2 
P(B: | Ai) = 5 P(B, | A2) = :. 
Hence we obtain the conditional probability 
2 2,3 3 = #13 
PB| B)y= 5° 3ats 5 55° 
Compare this with 
12,13 #41 
P(B,) = P(Ui)P(B: | Ui) + PCU2)P(B: | U2) = 5° 5 15 “35 


We see that the knowledge of the first ball drawn being black has strengthened 
the probability of drawing a second black ball, because it has increased the 
likelihood that we have picked the urn with more black balls. To proceed one 
more step, given that the first two balls drawn are both black and put back, 
what is the probability of drawing a third black ball from the same urn? We 
have in notation similar to the above: 


(2) 

PC; | BB) = 773 TRG = TB’ PU; | BBs) = 73 
(5) +35) 
4 2,9 3. 35 

P(B: | BiB) = 73°54 73° 5 = 6S 


This is greater than - so a further strengthening has occurred. Now it is easy 


to see that we can extend the result to any number of drawings. Thus, 
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3(5) 
2\5 l 
P(U,; | BB, see B,) = 1/\n 1 72\7n a 9 
H+ QY 14 (3) 
2\5 2\5 2 
where we have divided the denominator by the numerator in the middle term. 
It follows that as n becomes larger and larger, the a posteriori probability of 


U, becomes smaller and smaller, in fact it decreases to zero and consequently 
the a posteriori probability of U, increases to one in the limit. Thus we have 


; 3 
lim P(Baar | BB, tee B,) = 5 = PB, | U2). 


n> oO 


This simple example has important implications on the empirical view- 
point of probability. Replace the two urns above by a coin which may be 
biased (as all real coins are). Assume that the probability p of heads is either 
; or : but we do not know which is the true value. The two possibilities are 
then two alternative hypotheses between which we must decide. If they both 
have the a priori probability , then we are in the situation of the two urns. 
The outcome of each toss will affect our empirical estimate of the value of p. 


Suppose for some reason we believe that p = - Then if the coin falls heads 


10 
10 times in a row, can we still maintain that p = Z and give probability( = 


to this rare event? Or shall we concede that really p = : so that the same 


10 10 
event will have probability (3) ? This is very small but still (5) larger than 


the other. In certain problems of probability theory it is customary to consider 
the value of p as fixed and base the rest of our calculations on it. So the query 
is what reason do we have to maintain such a fixed stance in the face of dam- 
aging evidence given by observed outcomes? Keynes made a point of this 
criticism on the foundations of probability. From the axiomatic point of 
view, as followed in this book, a simple answer is this: our formulas are 
correct for each arbitrary value of p, but axioms of course do not tell us what 
this value is, nor even whether it makes sense to assign any value at all. The 
latter may be the case when one talks about the probability of the existence 
of some ‘“‘big living creatures somewhere in outer space.” [It used to be the 
moon!| In other words, mathematics proper being a deductive science, the 
problem of evaluating, estimating or testing the value of p lies outside its 
eminent domain. Of course, it is of the utmost importance in practice, and 
statistics was invented to cope with this kind of problem. But it need not 
concern us too much here. [The author had the authority of Dr. Albert 
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Einstein on this point, while on a chance stroll on Mercer Street in Princeton, 
N.J., sometime in 1946 or 1947. Here is the gist of what he said: in any branch 
of science which has applications, there is always a gap, which needs a bridge 
between theory and practice. This is so for instance in geometry or mechanics; 
and probability is no exception. | 

The preceding example has a natural extension when the unknown p may 
take values in a finite or infinite range. Perhaps the most celebrated illus- 
tration is Laplace’s Law of Succession below. 


Example 9. Suppose that the sun has risen n times in succession, what is the 
probability that it will rise once more? 

It is assumed that the a priori probability for a sunrise on any day is a 
constant whose value is unknown to us. Owing to our total ignorance it will 
be assumed to take all possible values in [0, 1] with equal likelihood. That is 
to say, this probability will be treated as a random variable ~ which is uni- 
formly distributed over [0, 1]. Thus ¢ has the density function f such that 
f(p) = 1 for 0 < p < 1. This can be written heuristically as 


(5.2.6) Pips &<p+dp)= dp, OS p<. 


Cf. the discussion in Example 10 of §4.5. Now if the true value of ¢ is p, then 
under this hypothesis the probability of n successive sunrises is equal to p*, 
because they are assumed to be independent events. Let S” denote the event 
that ‘‘the sun rises n times in succession,’ then we may write heuristically: 


(5.2.7) P(S" | & = p) = p”. 
The analogue to (5.2.3) should then be 


(5.2.8) P(S") = ooo P(E = p)P(S" | & = p). 


This is of course meaningless as it stands, but if we pass from the sum into 
an integral and use (5.2.6), the result is 


1 


1 
l 
5.2.9 pis) = [xs = a | "dp = ——- 
(5.2.9) (S*)= J POS" |E= p)dp= | p*dp= 
This continuous version of (5.2.3) is in fact valid, although its derivation 
above is not quite so. Accepting the formula and applying it for both n and 
n + 1, then taking the ratio, We obtain 


] 
| PISS") P(S*) ont 2 n+ 
n-+1 Ny = wee ae ees eee 
(5.2.10) P(S*+1| S*) P(S*) PS) a 
n+ 1 


This is Laplace’s answer to the sunrise problem. 
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In modern parlance, Laplace used an “‘urn model” to study successive 
sunrise as a random process. A sunrise is assimilated to the drawing of a black 
ball from an urn of unknown composition. The various possible compositions 
are assimilated to so many different urns containing various proportions of 
black balls. Finally, the choice of the true value of the proportion is assimi- 
lated to the picking of a random number in [0, 1]. Clearly, these are weighty 
assumptions calling forth serious objections at several levels. Is sunrise a ran- 
dom phenomenon or is it deterministic? Assuming that it can be treated as 
random, is the preceding simple urn model adequate to its description? As- 
suming that the model is appropriate in principle, why should the @ priori 
distribution of the true probability be uniformly distributed, and if not how 
could we otherwise assess it? 

Leaving these great questions aside, let us return for a moment to (5.2.7). 
Since P(é = p) = 0 for every p (see §4.5 for a relevant discussion), the so-called 
conditional probability in that formula is not defined by (5.1.3). Yet it makes 
good sense from the interpretation given before (5.2.7). In fact, it can be made 
completely legitimate by a more advanced theory [Radon-Nikodym deriva- 
tive]. Once this is done, the final step (5.2.9) follows without the intervention 
of the heuristic (5.2.8). Although a full explanation of these matters lies be- 
yond the depth of this textbook, it seems proper to mention it here as a 
natural extension of the notion of conditional probability. A purely discrete 
approach to Laplace’s formula is also possible but the calculations are harder 
(see Exercise 35 below). 

We end this section by introducing the notion of conditional expectation. 
In a countable sample space consider a random variable Y with range {y,} 
and an event S with P(S) > 0. Suppose that the expectation of Y exists, then 
its conditional expectation relative to S is defined to be 


(5.2.11) E(Y |S) =X yP(Y = ye |S). 


Thus, we simply replace in the formula E(Y) = >> y.P(Y = y,) the proba- 
k 
bilities by conditional ones. The series in (5.2.11) converges absolutely because 
the last-written series does so. In particular if X is another random variable 
with range {x,}, then we may take S = {X = x,} to obtain E(Y | X = x,). 
On the other hand, we have as in (5.2.4): 
PCY = yx) = DPX = x, )P(Y = ye | X = X;). 
I 


Multiplying through by y,, summing over k and rearranging the double series, 
we obtain 


(5.2.12) E(Y) = P(X = x) E(Y| X = x,). 


The rearrangement is justified by absolute convergence. 
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The next two sections contain somewhat special material. The reader 
may read the beginnings of §§5.3 and 5.4 up to the statements of Theorems 1 
and 3 to see what they are about, but postpone the rest and go to §5.5. 


5.3.* Sequential sampling 


In this section we study an urn model in some detail. It is among the simplest 
schemes that can be handled by elementary methods. Yet it presents rich 
ideas involving conditioning which are important in both theory and practice. 

An.urn contains b black balls and r red balls. One ball is drawn at a time 
without replacement. Let X, = 1 or O according as the nth ball drawn is 
black or red. Each sample point w is then just the sequence {Xi(w), X2(w), 
...,Xbitw)$, briefly {X,,1<n<b+7r}; see the discussion around 
(4.1.3). Such a sequence is called a stochastic process, which is a fancy name 
for any family of random variables. [According to the dictionary, “‘stochas- 
tic’ comes from a Greek word meaning “to aim at.”’| Here the family is the 
finite sequence indexed by n from | to b + r. This index m may be regarded as 
a time parameter as if one drawing is made per unit time. In this way we can 
speak of the gradual evolution of the process as time goes on by observing 
the successive X,,’S. 

You may have noticed that our model is nothing but sampling without 
replacement and with ordering, discussed in §3.2. You are right but our view- 
point has changed and the elaborate description above is meant to indicate 
this. Not only do we want to know e.g., how many black balls are drawn after 
so many drawings, as we would previously, but now we want also to know 
how the sequential drawings affect each other, how the composition of the 
urn changes with time, etc. In other words, we want to investigate the mutual 
dependence of the X,’s, and that’s where conditional probabilities come in. 
Let us begin with the easiest kind of question. 


Problem. A ball is drawn from the urn and discarded. Without knowing its 
color, what is the probability that a second ball drawn is black? 

For simplicity let us write the events {X, = 1} as B, and {X, = 0} as 
R, = Bn. We have then from Proposition 2 of §5.1, 


(5.3.1) P(B,) = P(B,)P(B: | Bi) + P(B)P(B | BY). 


Clearly we have 


b ao. Sr, 
(5.3.2) PB) = pa P(BD = ps 
whereas 
b— I ae b 
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since there are b + r — | balls left in the urn after the first drawing, and 
among these are b — 1 or b black balls according as the first ball drawn is or 
is not black. Substituting into (5.3.1) we obtain 


bo b-1 | or br) 
b+rb+r—1  b+rb+r—1 (6+r\6b+r—1) b+r 


P(Bo) = 


Thus P(B.) = P(B,); namely if we take into account both possibilities for the 
color of the first ball, then the probabilities for the second ball are the same 
as if no ball had been drawn (and left out) before. Is this surprising or not? 
Anyone with curiosity would want to know whether this result is an accident 
or has a theory behind it. An easy way to test this is to try another step or 
two: suppose 2 or 3 balls have been drawn but their colors not noted, what 
then is the probability that the next ball will be black? You should carry out 
the simple computations by all means. The general result can be stated suc- 
cinctly as follows. 


Theorem 1. We have for each n, 


b 


It is essential to pause here and remark on the economy of this mathe- 
matical formulation, in contrast to the verbose verbal description above. The 
condition that “we do not know” the colors of the n — 1 balls previously 
drawn is observed as it were in silence, namely by the absence of conditioning 
for the probability P(B,). What should we have if we know the colors? It 
would be something like P(B,| B:) or P(B; | B,B2) or P(B, | B,B2B3). These 
are trivial to compute (why?); but we can also have something like P(B, | Bo) 
or P(B, | B,B3) which is slightly less trivial. See Exercise 33. 

There are many different ways to prove the beautiful Theorem above; each 
method has some merit and is useful elsewhere. We will give two now, a third 
one in a tremendously more general form (Theorem 4 in §5.4) later. But there 
are others and perhaps you can think of one later. The first method may be 
the toughest for you; if so skip it and go at once to the second.t 


First Method. This may be called “direct confrontation” or ‘“‘brute force”’ 
and employs heavy (though standard) weaponry from combinatory arsenal. 
Its merit lies in that it is bound to work provided that we have guessed the 
answer in advance, as we can in the present case after a few trials. In other 
words, it is a sort of experimental verification. We introduce a new random 
variable Y,, = the number of black balls drawn in the first n drawings. This 
gives the proportion of black balls when the n + Ist drawing is made since 
the total number of balls then is equal to 6b + r — n, regardless of the out- 
comes of the previous n drawings. Thus we have 


+ A third method is to make mathematical induction on n. 
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b—-J 


(5.3.4) P(Bnti| Yn = j) = btr—-n 


O<j<b. 


On the other hand, the probability P(Y, = /) can be computed as in Problem 
1 of §3.4, with m = b+7,k = bin (3.4.1): 


b 
535) ney, = bn) 


Cr) 


(5.3.6) P( Buss) = x P(Y, = j)P(Bus | Yn = J) 


We now apply (5.2.4): 


This will surely give the answer, but how in the world are we going to com- 
pute a sum like that? Actually it is not so hard, and there are excellent mathe- 
maticians who make a career out of doing such (and much harder) things. 
The beauty of this kind of computation is that it’s got to unravel if our guess 
is correct. This faith lends us strength. Just write out the several binomial 
coefficients above explicitly, cancelling and inserting factors with a view to 
regrouping them into new binomial coefficients: 


b! r! ni(bt+r—n)! b-j 
Mb-pPia@-Mer-nrnt+/)! (O+n! b+r—n 
bir! (6+r—n-— 1)! n! 


~ b+Hir—nt+plb—j—Dijia—J/! 


_ | PProt ty) ("). 
(Ctr b-—j-1 J 
b 


Hence 


1 %1/n\ (b+r—1- 
C30 Rb Olesen) 
5’) 


where the term corresponding to j = b has been omitted since it yields zero 
in (5.3.6). The new sum in (5.3.7) is a well-known identity for binomial coeffi- 


cients and is equal to (° a '): see §3.9. Thus 
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red O5T)/ C5) oom 


as asserted in (5.3.3). 


Second Method. This is purely combinatorial and can be worked out as an 
example in §3.2. Its merit is simplicity; but it cannot be easily generalized to 
apply to the next urn model we shall consider. 

Consider the successive outcomes in_n + 1 drawings: X,(w), X2(w),..., 
Xr(w), Xn41(w). Each X,(w) is 1 or 0 depending on the particular w; even the 
numbers of 1’s and 0’s among them depend on w when n+ 1< b+ 4. 
Two different outcome-sequences such as 0011 and 0101 will not have the 
same probability in general. But now let us put numerals on the balls, say 
1 to b for the black ones and 6 + _1 to b+ ¢ for the red ones so that all balls 
become distinguishable. We are then in the case of sampling without replace- 
ment and with ordering discussed in §3.2. The total number of possibilities 
with the new labeling is given by (3.2.1) with b + r for m and n+ 1 for n: 
(6 + r)ny1. These are now all equally likely! We are interested in the cases 
where the n + Ist ball is black; how many are there for these? There are 5 
choices for the n + Ist ball, and after this is chosen there are (6 + r — 1) 
ways of arranging the first n balls, by another application of (3.2.1). Hence 
by the Fundamental Rule in §3.1, the number of cases where the n + Ist ball 
is black is equal to b(6 + r — 1),. Now the classical ratio formula for proba- 
bility applies to yield the answer 


_b6+r—1), 2b 
P(Bnsa) = (b+rna b+r 


Undoubtedly this argument is easier to follow after it is explained, and 
there is little computation. But it takes a bit of perception to hit upon the 
counting method. Poisson [1781-1840; French mathematician for whom a 
distribution, a process, a limit theorem and an integral were named, among 
other things] gave this solution but his explanation is more brief than ours. 
We state his general result as follows. 


Theorem 2 [Poisson’s Theorem]. Suppose in an urn containing b black and r 
red balls, n balls have been drawn first and discarded without their colors being 
noted. If m balls are drawn next, the probability that there are k black balls 
among them is the same as if we had drawn these m balls at the outset | without 
having discarded the n balls previously drawn]. 

Briefly stated: The probabilities are not affected by the preliminary draw- 
ing so long as we are in the dark as to what those outcomes are. Obviously if 
we know the colors of the balls discarded, the probabilities will be affected 
in general. To quote [Keynes, p. 349]: ‘This is an exceedingly good ex- 
ample . . . that a probability cannot be influenced by the occurrence of a 
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material event but only by such knowledge as we may have, respecting the 
occurrence of the event.”’ 

Here is Poisson’s quick argument: If » + m balls are drawn out, the 
probability of a combination which is made up of nv black and red balls in 
given proportions followed by m balls of which k are black and m — k are 
red, must be the same as that of a similar combination in which the m balls 
precede the n balls. Hence the probability of k black balls in m drawings given 
that n balls have already been drawn out, must be equal to the probability 
of the same result when no balls have been previously drawn out. 

Is this totally convincing to you? The more explicit combinatorial argu- 
ment given above for the case m = | can be easily generalized to settle any 
doubt. The doubt is quite justified despite the authority of Poisson. As we 
may learn from Chapter 3, in these combinatorial arguments one must do 
one’s own thinking. 


5.4.* Pélya’s urn scheme 


To pursue the discussion in the preceding section a step further, we will study 
a famous generalization due to G. Pélya [1887-; professor emeritus at 
Stanford University, one of the most eminent analysts of modern times who 
also made major contributions to probability and combinatorial theories and 
their applications]. As before the urn contains } black and r red balls to begin 
with, but after a ball is drawn each time, it is returned to the urn and c balls 
of the same color are added to the urn, where c is an integer and when c < 0 
adding c balls means subtracting —c balls. This may be done whether we 
observe the color of the ball drawn or not; in the latter case, e.g., we may 
suppose that it is performed by an automaton. If c = 0 this is just sampling 
with replacement, while if c = —1 we are in the situation studied in §5.3. In 
general if c 1s negative the process has to stop after a number of drawings, 
but if c is zero or positive it can be continued forever. This scheme can be 
further generalized (you know generalization is a mathematician’s bug!) if 
after each drawing we add to the urn not only c balls of the color drawn but 
also d balls of the other color. But we will not consider this, and furthermore 
we will restrict ourselves to the case c > —1, referring to the scheme as 
Pélya’s urn model. This model was actually invented by him to study a prob- 
lem arising in medicine; see the last paragraph of this section. 


Problem. What is the probability that in Pdélya’s model the first three balls 
drawn have colors {b, b, r} in this order? or {b, r, b}? or {r, b, b}? 


An easy application of Proposition 1 in §5.2 yields, in the notation intro- 
duced in §5.3: 


P(B,B.R3) = P(By)P(B2 By)P(R3 | BiB) 


_ b b+c¢_ r 
~~ b+rbt+rt+ecb+r42e 


(5.4.1) 
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Similarly 
ro le 
POBR:B) = po po +eb+r+ 2c’ 
P(Ri BB.) = . ° pate : 


a b+r+ecb+r4+2e 


Thus they are all the same, namely the probability of drawing 2 black and 1 
red balls in three drawings does not depend on the order in which they are 
drawn. It follows that the probability of drawing 2 black and 1 red in the 
first three drawings is equal to three times the number on the right side of 
(5.4.1). 

The general result is given below. Recall the definition of X, given in §5.3, 
which need not be changed although the scheme has changed. 


Theorem 3. The probability of drawing (from the beginning) any specified 
sequence of k black balls and n — k red balls is equal to 


bb +c) ++: (6+ (k — Dorr +6): G+ @—k — Ve) 


foralln> life>0;andfor0O<n<b+rife= —1. 


Proof: This is really an easy application of Proposition | in §5.2, butina 
scrambled way. We have shown it above in the case k = 2 and n = 3. If 
you will try a few more cases with say n = 4, k = 2 orn = 5, k = 3, you 
will probably see how it goes in the general case more quickly than it can be 
explained in words. The point is: at the mth drawing, where | < m < 2, the 
denominator of the corresponding conditional probability in (5.2.1) 1s 
b-+r-+(m — 1)c, because a total of (m — 1)c balls have been added to the 
urn by this time, no matter what balls have been drawn. Now at the first time 
when a black ball is drawn, there are b black balls in the urn; at the second 
time a black ball is drawn, the number of black balls in the urn is b + c, 
because one black ball has been previously drawn so c black balls have been 
added to the urn. This is true no matter at what time (which drawing) the 
second black ball is drawn. Similarly when the third black ball 1s drawn there 
will be b + 2c black balls in the urn, and so on. This explains the k factors 
involving b in the numerator of (5.4.2). Now consider the red balls: at the 
first time a red ball is drawn there are y red ones in the urn; at the second 
time a red ball is drawn, there are r + c red ones in the urn, because c red 
balls have been added after the first red one is drawn, and so on. This explains 
the n — k factors involving r(=red) in the numerator of (5.4.2). The whole 
thing there is therefore obtained by multiplying the successive ratios as the 
conditional probabilities in (5.2.1), and the exact order in which the factors 
in the numerator occur is determined by the specific order of blacks and reds 
in the given sequence. However, their product is the same so long as n and k 
are fixed. This establishes (5.4.2). 
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For instance if the specified sequence is RBRRB then the exact order in the 
numerator should be rb(r + c)(r + 2c)(b + c). 

Now suppose that only the numbers of each color are given [specified! | 
but not the exact sequence, then we have the next result. 


Theorem 4. The probability of drawing (from the beginning) k black balls in n 
drawings is equal to the number in (5.4.2) multiplied by (i). In terms of gen- 


eralized binomial coefficients (see (5.4.4) below), it is equal to 


CM) 
Cc Cc 
(5.4.3) - Ae 
Cc 
Cr) 
nN 


Proof: There are ( 1) ways of permuting k black and n — k red balls; see 


$3.2. According to (5.4.2), every specified sequence of drawing k black and 
n — k red balls have the same probability. These various permutations corre- 
spond to disjoint events. Hence the probability stated in the theorem is just 
the sum of @ probabilities each of which is equal to the number given in 
(5.4.2). It remains to express this probability by (5.4.3), which requires only 
a bit of algebra. Let us note that if a is a positive real number and / is a posi- 
tive integer, then by definition 


(5.4.4) 


(—“) _ (—a)(—a — 1) : 


(-a-jt)_ 
J J! 


(—1) oe 


Thus if we divide every factor in (5.4.2) by c, and write 


b + 
B= yo 


for simplicity, then use (5.4.4), we obtain 


BB+ 1)-:-C@+k—)Irvyt)::-@Mta-k—) 
(@+yG+r7v+1)---@+rvt+a-1) 


_ (—1)*k! (7) (—1)"-"(n — k)! (,-”,) : (7") (1s) 
om (I) Cn) Gi) 


‘) we get (5.4.3) as asserted. 


After multiplying by ( k 
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We can now give a far-reaching generalization of Theorems 1 and 2 in 
§5.3. Furthermore the result will fall out of the fundamental formula (5.4.2) 
like a ripe fruit. Only a bit of terminology and notation is in the way. 

Recalling the definition of X, in §5.3, we can record (5.4.2) as giving the 
joint distribution of the n random variables {X, X2,..., Xn}. Let us intro- 
duce the hedge symbol “‘@)” to denote either “0”? or “1”’ and use subscripts 
(indices) to allow arbitrary choice for each subscript, independently of each 
other. On the other hand, two such symbols with the same subscript must of 
course denote the same choice throughout a discussion. For instance, 
{@1, D2, Ds, @s} may mean {1, 1,0, 1} or {0, 1,0, 1}, but then {@,, @s} 
must mean {1, 0} in the first case and {0, 0} in, the second. Theorem 3 can be 
stated as follows: if k of the @’s below are 1’s and n — k of them are 0’s, then 


(5.4.5) P(X, = Oi, X2 = Da... Xn = Dn) 


is given by the expression in (5.4.2). There are altogether 2” possible choices 
for the @’s in (5.4.5) [why?], and if we visualize all the resulting values corre- 
sponding to these choices, the set of 2” probabilities determines the joint 
distribution of {X1, X,..., Xn}. Now suppose {m, m2, ..., M$ iS a subset 
of {1,2,...,} the joint distribution of {X,,,..., Xn, 18S determined by 


(5.4.6) P(Xn: = Dns sty Xn, = Qn) 


when the latter @’s range over all the 2° possible choices. This is called a 
marginal distribution with reference to that of the larger set {M%i,..., Xn}. 

We need more notation! Let {nj,...,z} be the complementary set of 
{n,..., Ms} with respect to {1,...,”}, namely those indices left over after 
the latter set has been taken out. Of course f = n — sand the union {m,..., 
Ny, Ni, ..., i} is just some permutation of {1,...,}. Now we can write 
down the following formula expressing a marginal probability by means of 
joint probabilities of a larger set: 


P(UXn, = Gh, . . 3 An, = );) 
(5.4.7) = 2 op am = @..., Xn = Os 
X ny! — @), es 8g Xni = @):), 


where {@ji,..., @z} is another set of hedge symbols and the sum is over 
all the 2¢ possible choices for them. This formula follows from the obvious 
set relation 


{Xn = Or... Xn = Ds} 
= oy De gy Hm = Ore se Xue = Dey Xm = Obs os Ans = Oe 
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and the additivity of P. [Clearly a similar relation holds when the X’s take 
other values than 0 or 1, in which case the G)’s must be replaced by all possible 
values. | 

We now come to the piéce de résistance of this discussion. It will sorely 
test your readiness to digest a general and abstract argument. If you can’t 
swallow it now, you need not be upset but do come back and try it again later. 


° 3 


Theorem 5. The joint distribution of any s of the random variables {X,, Xo, . . 
X,, ...} is the same. 


As noted above, the sequence of X,,’s 1s infinite if c > 0, whereas n < 
b+rife= —l. 


Proof: What does the theorem say? Fix s and let X,,,..., Xn, be any set of 
s random variables chosen from the entire sequence. To discuss its joint dis- 
tribution, we must consider all possible choices of values for these s random 
variables. So we need a notation for an arbitrary choice of that kind, call it 
@,..., @,. Now let us write down 


PUXn, — Qs, Xnz = Qe, see Xn, = (1),). 


We must show that this has the same value no matter what {m,..., m.} iS, 
namely that it has the same value as 


P(Xm = @), Xm = Dp, . . .> Xm = @s) 


where {7m,..., ms} is any other subset of size s. The two sets {m,..., 7} 
and {m,..., ms} may very well be overlapping, such as {1, 3,4} and 
{3, 2, 1}. Note also that we have never said that the indices must be in in- 
creasing order! 

Let the maximum of the indices used above be n. As before let t = n — s, 
and 


{ni,..., mi} = f{l,..., nm} — {m,..., Ms}, 

{mi,...,m} = {l,...,n} — {m,...,m,}. 
Next, let @i,...,(@, be an arbitrary choice of t hedge symbols. We claim 
then 
(5.4.8) P(X, = @Oy..., Xn = Ds Xn = OL... 5 Xnv = Di) 

= P(Xm = O1,..., Xm = Da Xm = OD. . +, Xm = Di). 
If you can read this symbolism you will see that it is just a consequence of 
(5.4.2)! For both (m,...,s,mi..., 72) and (m,...,™M., mi,..., mi) are 


permutations of the whole set (1,...,), whereas the set of hedge symbols 
(D,..-.,@s, @1,...,@z) are the same on both sides of (5.4.9). So the 
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equation merely repeats the assertion of Theorem 3 that any two specified 
sequences having the same number of black balls must have the same proba- 
bility, irrespective of the permutations. 

Finally keeping @),.. ., @, fixed but letting @i,..., @ vary over all 
2* possible choices, we get 2‘ equations of the form (5.4.8). Take their sum 
and use (5.4.7) once as written and another time when the n’s are replaced 
by the m’s. We get 


P(Xn, = @, cg Xn = (4),) = P(Xm = G@), rary Xm, = ()s), 


as we Set out to show. Q.E.D. 


There is really nothing hard or tricky about this proof. “It’s just the 
notation!’’, as some would say. 

A sequence of random variables {X,;n = 1, 2,...} having the property 
given in Theorem 5 is said to be “‘permutable”’ or “‘exchangeable.”’ It follows 
in particular that any block of given length s, such as Xn41, Xs+2).. +5 Xso-+es 
where sp is any nonnegative integer (and 5 +s <b+ rif c= —1), have 
the same distribution. Since the index is usually interpreted as the time param- 
eter, the distribution of such a block may be said to be “invariant under a 
time-shift.”” A sequence of random variables having this property is said to 
be “[strictly] stationary.” This kind of process is widely used as a model in 
electrical oscillations, economic time series, queuing problems etc. 

Pélya s scheme may be considered as a model for a fortuitous happening 
[a “‘random event’”’ in the everyday usage] whose likelihood tends to increase 
with each occurrence and decrease with each non-occurrence. The drawing 
of a black ball from his urn is such an event. Pélya himself cited as example 
the spread of an epidemic in which each victim produces many more new 
germs and so increases the chances of further contamination. To quote him 
directly (my translation from the French original), “In reducing this fact to 
its simplest terms and adding to it a certain symmetry, propitious for mathe- 
matical treatment, we are led to the urn scheme.” The added symmetry refers 
to the adding of red balls when a red ball is drawn, which would mean that 
each non-victim also increases the chances of other non-victims. This half 
of the hypothesis for the urn model does not seem to be warranted, and is 
slipped in without comment by several authors who discussed it. Professor 
Pédlya’s candor, in admitting it as a mathematical expediency, should be re- 
assuring to scientists who invented elaborate mathematical theories to deal 
with crude realities such as hens pecking (mathematical psychology) and 
beetles crawling (mathematical biology). 
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An extreme and extremely important case of conditioning occurs when the 
condition has no effect on the probability. This intuitive notion is common 
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experience in tossing a coin or throwing a die several times, or drawing a ball 
several times from an urn with replacement. The knowledge of the outcome 
of the previous trials should not change the “‘virgin’”’ probabilities of the next 
trial and in this sense the trials are intuitively independent of each other. We 
have already defined independent events in §2.4; observe that the defining 
relations in (2.4.5) are just special cases of (5.2.1) when all conditional proba- 
bilities are replaced by unconditional ones. The same replacement in (5.2.2) 
will now lead to the fundamental definition below. 


Definition of Independent Random Variables. The countably-valued random 


variables Xi,..., Xn are said to be independent iff for any real numbers 
X1,..., Xn, We have 
(5.5.1) P(X = X1,.-.+5 Xn = Xn) = POX = x1) . ~PCXh = Xn). 


This equation is trivial if one of the factors on the right is equal to zero, hence 
we may restrict the x’s above to the countable set of all possible values of all 
the X’s. 

The deceptively simple condition (5.5.1) actually contains much more than 
meets the eye. To see this let us deduce at once a major extension of (5.5.1) 
in which single values x, are replaced by arbitrary sets S,. Let %1,..., Xn 
be independent random variables in Propositions 4 to 6 below. 


Proposition 4. We have for arbitrary countable sets Si, ..., Sn: 


Proof: The left member of (5.5.2) is equal to 


See YPM = m,.. 2, Xn = Xn) 
m€S1 an€_Sn 
= eee > PCM = x)... P(Xn = Xn) 
aC Si an€_Sn 
= { 2 PU = mh... | 2 Pn = Xn), 
neESi rnES8n 


which is equal to the right member of (5.5.2) by simple algebra (which you 
should spell out if you have any doubt). 

Note that independence of a set of random variables as defined above is 
a property of the set as a whole. Such a property is not necessarily inherited 
by a subset; can you think of an easy counter-example? However, as a con- 
sequence of Proposition 4, any subset of (Xi, . . . , X;,) 18 indeed also a set of 
independent random variables. To see e.g. (%, X2, X3) is such a set when 
n > 3 above, we take S, = R! for i > 3 and replace the other S,’s by x, in 
(5.5.2). 

Next, the condition (5.5.2) will be further strengthened into its most useful 
form. 
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Proposition 5. The events 


(5.5.3) {Xi © Si},..., {Xn © Sa} 


are independent. 


Proof: It is important to recall that the definition of independent events re- 
quires not only the relation (5.5.2), but also similar relations for all subsets 
of (%,..., X,). However, these also hold because the subsets are also sets 
of independent random variables, as just shown. 

Before going further let us check that the notion of independent events 
defined in §2.4 is a special case of independent random variables defined in 
this section. With the arbitrary events {A,,1 <j <n} we associate their 
indicators I,4, (see §1.4), where 


A( y= {i ” 1< r<ON 
] w& . 
') if w € AS; 4 


These are random variables [at least in a countable sample space]. Each takes 
only the two values 0 or I, and we have 


U4, = I} = A, (Ia, = Of} = A}. 


Now if we apply the condition (5.5.1) of independence to the random variables 
I4,,...,14,, they reduce exactly to the conditions 


(5.5.4) P(A, --- A,) = P(A) -- + P(An), 


where each A, may be A, or 4{, but of course must be the same on both sides. 
Now it can be shown (Exercise 36 below) that the condition (5.5.4) for all 
possible choices of A,, is exactly equivalent to the condition (2.4.5). Hence 
the independence of the events A;,..., A, iS equivalent to the independence 
of their indicators. 

The study of independent random variables will be a central theme in any 
introduction to probability theory. Historically and empirically, they are 
known as independent trials. We have given an informal discussion of this 
concept in §2.4. Now it can be formulated in terms of random variables as 
follows: a sequence of independent trials is just a sequence of independent 
random variables (%,..., X,) where X, represents the outcome of the ith 
trial. Simple illustrations are given in Examples 7 and 8 of §2.4, where in 
Example 7 the missing random variables are easily supplied. Incidentally, 
these examples establish the existence of independent random variables so 
that we are assured that our theorems such as the propositons in this section 
are not vacuities. Actually we can even construct independent random vari- 
ables with arbitrarily given distributions (see [Chung 1; Chapter 3]). [It 
may amuse you to know that mathematicians have been known to define and 
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study objects which later turn out to be non-existent! ] This remark will be 
relevant in later chapters; for the moment we shall add one more general 
proposition to broaden the horizon. 


Proposition 6. Let gi, .. ., gn be arbitrary real-valued functions on (—%, ©); 
then the random variables 


(5.5.5) gi(X%1), «©, Gn Xn) 
are independent. 


Proof: Let us omit the subscripts on X and ¢ and ask the question: for a 
given real number y, what are the values of x such that 


g(x)=y and X= x? 


The set of such values must be countable since X is countably-valued; call it 
S, of course it depends on y, g and X. Then {y(X) = y} means exactly the 


same thing as {X € S}. Hence for arbitrary ),,..., yn, the events 
{o(X1) = yi} cy {¢n( Xn) = Yn} 
are just those in (5.5.3) for certain sets S;,..., S, specified above. So Propo- 


sition 6 follows from Proposition 5. 

This proposition will be put to good use in Chapter 6. Actually there is a 
more general result as follows. If we separate the random variables X,..., 
X, into any number of blocks, and take a function of those in each block, 
then the resulting random variables are independent. The proof is not so 
different from the special case given above, and will be omitted. 

As for general random variables, they are defined to be independent iff 
for any real numbers x,..., Xn, the events 


(5.5.6) {X, < xi}, sey {Xn < Xn} 
are independent. In particular, 
(5.5.7) P(X < m,..., Xn < Xn) = PUM < x)... PUXn < Xn). 


In terms of the joint distribution function F for the random vector (X,. . . 
Xn) discussed in §4.5, the preceding equations may be written as 


(5.5.8) F(x... 5 Xn) = Fi(x). . . Fra(Xn) 


where F, is the marginal distribution of X,, 1 <_j,< n. Thus in case of inde- 
pendence the marginal distributions determine the joint distribution. 

It can be shown that as a consequence of the definition, events such as 
those in (5.5.3) are also independent, provided that the sets S,,..., S, are 
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reasonable [Borel]. In particular if there is a joint density function f, then we 
have 


PIXE Sy... WESI={f fd dub --- {fo fae du} 
= [ a [, fey) ++ fi(Un) du, +++ dun 


where /fi,...,f/n are the marginal densities. But the probability in the first 
member above is also equal to 


fore fps 6-5 te) dit. dy 
as in (4.5.6). Comparison of these two expressions yields the equation 


(5.5.9) f(u,..., Un) = fit)... fa(un). 


This is the form that (5.5.8) takes in the density case. 

Thus we see that stochastic independence makes it possible to factorize a 
joint probability, distribution or density. In the next chapter we shall see that 
it enables us to factorize mathematical expectation, generating function and 
other transforms. 

Numerous results and applications of independent random variables will 
be given in Chapters 6 and 7. In fact, the main body of classical probability 
theory is concerned with them. So much so that in his epoch-making mono- 
graph Foundations of the Theory of Probability, Kolmogorov [1903-; leading 
Russian mathematician and one of the founders of modern probability the- 
ory] said: ““Thus one comes to perceive, in the concept of independence, at 
least the first germ of the true nature of problems in probability theory.’ Here 
we will content ourselves with two simple examples. 


Example 10. A letter from Pascal to Fermat (dated Wednesday, 29th July, 
1654), contains, among many other mathematical problems, the following 
passage: 

“M. de Méré told me that he had found a fallacy in the theory of numbers, 
for this reason: If one undertakes to get a six with one die, the advantage in 
getting it in 4 throws is as 671 is to 625. If one undertakes to throw 2 sixes 
with two dice, there is a disadvantage in undertaking it in 24 throws. And 
nevertheless 24 is to 36 (which is the number of pairings of the faces of two 
dice) as 4 is to 6 (which is the number of faces of one die). This is what made 
him so indignant and made him say to one and all that the propositions were 
not consistent and Arithmetic was self-contradictory: but you will very easily 
see that what I say is correct, understanding the principles as you do.”’ 

This famous problem, one of the first recorded in the history of probabil- 
ity and which challenged the intellectual giants of the time, can now be solved 
by a beginner. 
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To throw a six with one die in 4 throws means to obtain the point “six” 
at least once in 4 trials. Define X,, 1 <n < 4, as follows: 


P(X, =) = 3 k= 1,2,...,6, 
and assume that Xi, X:, X3, X, are independent. Put A, = {X, = 6}; then 
the event in question is 4; U Ay U Az U Ay. It is easier to calculate the proba- 
bility of its complement which is identical to 4{A$ASAé. The trials are assumed 
to be independent and the dice unbiased. We have as a case of (5.5.4), 


P(AiASASAD) = P(A1)P(AS)P(AS)P(AD = (¢) : 


hence 


5\4 625 671 
P(A:U AU 4 U Ae) = 1 = (2) =1- 66 -oL 


This last number is approximately equal to 0.5177. Since 1296 — 671 = 625, 
the “‘odds”’ are as 671 to 625 as stated by Pascal. 


Next consider two dice; let (Xz, Xz’) denote the outcome obtained in the 
nth throw of the pair, and let 


Bn = {Xn = 6; Xn’ = 6}. 
Then P(B;,) = 2. and 


24 
P(BSBS --- Bis) = (32) , 


35 24 
P(BLU Be U  U Ba) = 1 (32) | 


This last number is approximately equal to 0.4914, which confirms the 
disadvantage. 


One must give great credit to de Méré for his sharp observation and long 
experience at gaming tables to discern the narrow inequality 


P(A U Az U 4s U Ay) > 5 > PUBL U By U + U Bu) 


His arithmetic went wrong because of a fallacious “linear hypothesis.” [Ac- 
cording to some historians the problem was not originated with de Méré. | 


Example 11. If two points are picked at random from the interval [0, 1], 
what is the probability that the distance between them is less than 1 /2? 
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Figure 24 


By now you should be able to interpret this kind of cryptogram. It means: 
if X and Y are two independent random variables each of which is uniformly 
distributed in [0, 1], find the probability P (\x — Yi\< 3) Under the hy- 


potheses the random vector (X, Y) is uniformly distributed over the unit 
square U (see Figure); namely for any reasonable subset S of U, we have 


P{(X, Y)C S} = Tl du do. 
S 


This is seen from the discussion after (4.6.6); in fact the f(u, v) there is equal 
to fi(u)f2(v) by (5.5.9) and both f; and f2 are equal to one in [0, 1] and zero 
outside. For the present problem S is the set of points (u, v) in U satisfying 


the inequality |u — v| < * You can evaluate the double integral above over 


this set if you are good at calculus, but it is a lot easier to do this geometrically 
as follows. Draw the two lines u — v = ; andu—v= -5; then S is the area 
bounded by these lines and the sides of the square. The complementary area 


2 
U — Sis the union of two triangles each of area ; (5) a Hence we have 


2 8 
1 3 
areaofS=1—2-,= 7 
and this is the required probability. 
Example 12. Suppose Xi, X,..., X, are independent random variables 
with distributions Fi, Fo, ..., F, as in (4.5.4). Let 


M = max (X, ), Cae Xn) 
m= min(X, X,..., Xn). 


Find the distribution functions of M and m. 
Using (5.5.7) we have for each x: 


5.5. Independence and relevance 141 


Fmax(x) = P(M < x) = P(X < x3 Xp. S x5... Xn SX) 
= P(X, < x)P(X%, < x)... P(X, < x) 
= F\(x)F.(x) +--+ Fi(x). 


In particular if all the F’s are the same, 
Frnax(X) = F(x)". 


Turning to the minimum, it is convenient to introduce the “tail distribution” 
G, corresponding to each F, as follows: 


G,(x) = P{X, > x} = 1 — F,(x). 
Then we have, using the analogue of (5.5.3) this time with S, = (x,, ©): 


Gmin(x) = P(m > x) = P(X > x; X. > x3...5 Xn > X) 
= P(X, > x)P(X. > x) +--+ P(Xn > x) 
= G,(x)G.(x) --- G,(x). 
Hence 
Fmin(X) = 1 — G(x)G(x) «++ G(X). 


If all the F’s are the same, this becomes 
Gmin(X) = G(x)"; Fmin(x) = 1 — G(x). 


Here is a concrete illustration. Suppose a town depends on 3 reservoirs 
for its water supply, and suppose that its daily draws from them are inde- 
pendent and have exponential densities e—**, e~*, e~** respectively. Sup- 
pose each reservoir can supply a maximum of N gallons per day to that town. 
What is the probability that on a specified day the town will run out of water? 

Call the draws X1, X2, X; on that day, the probability in question is by 
(4.5.12) 


PX > N; Xo > N; X3 > N) — e7 MN e—dA2N p— aN = em Ait ArP Aa) 


* The rest of the section is devoted to a brief study of a logical notion 
which is broader than pairwise independence. This notation is inherent in 
statistical comparison of empirical data, operational evaluation of alternative 
policies, etc. Some writers even base the philosophical foundation of statistics 
on such a qualitative notion. 

An event A is said to be favorable to another event B iff 


(5.5.10) P(AB) > P(A)P(B). 


This will be denoted symbolically by A || B. It is thus a binary relation be- 
tween two events which includes pairwise independence as a special case. 


142 Conditioning and Independence 


An excellent example is furnished by the divisibility by any two positive inte- 
gers; see §2.4 and Exercise 17 in Chapter 2. 

It is clear from (5.5.10) that the relation || is symmetric; it is also reflexive 
since P(A) > P(A)? for any A. But it is not transitive, namely A || Band B|| C 
do not imply A |] C. In fact, we will show by an example that even the stronger 
relation of pairwise independence is not transitive. 


Example 13. Consider families with 2 children as in Example 5 of §5.1: 2 = 
{(bb), (bg), (gb), (gg)}. Let such a family be chosen at random and consider 
the three events below: 


I 


A = first child is a boy, 
B = the two children are of different sex, 
C = first child is a girl. 


I 


Then 
AB = {(bg)}, BC = {(gb)}, AC = ©. 


A trivial computation then shows that P(A B) = P(A)P(B), P(BC) = P(B)P(C) 
but P(AC) = 0 + P(A)P(C). Thus the pairs {A, B} and {B, C} are inde- 
pendent but the pair {A, C} is not. 

A slight modification will show that pairwise independence does not imply 
total independence for three events. Let 


D = second child is a boy. 
Then 
AD = {(6b)}, BD = {(gb)}, ABD = @; 


and so P(ABD) = 0 ¥ P(A)P(B)P(D) = 1/8: 

Not so long ago one could still find textbooks on probability. and statistics 
in which total independence was confused with pairwise independence. It is 
easy on hindsight to think of everyday analogues of the counter-examples 
above. For instance, if A is friendly to B, and Bis friendly to C, why should 
it follow that A is friendly to C? Again, if every two of three people A, B, C 
get along well, it is not necessarily the case that all three of them do. 

These commonplace illustrations should tell us something about the use 
and misuse of “‘intuition.’’ Pushing a bit further, let us record a few more 
non-sequitors below (“‘=>”’ reads “‘does not imply’’): 


A||C and B|| C= (AM B)|| C; 
A||B and A||C+ A|| (BNO); 
AllC and B|| C= (AU B)||C; 
A||B and A||/ C+ All| (BUC). 


(5.5.11) 


You may try some verbal explanations for these; rigorous but artificial exam- 
ples are also very easy to construct; see Exercise 15. 
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The great caution needed in making conditional evaluation is no academic 
matter, for much statistical analysis of experimental data depends on a critical 
understanding of the basic principles involved. The following illustration is 
taken from Colin R. Blyth, “On Simpson’s paradox and the sure-thing 
principle,” Journal of American Statistical Association, Vol. 67 (1972) pp. 
364-366. 


Example 14. A doctor has the following data on the effect of a new treatment. 
Because it involved extensive follow-up treatment after discharge, he could 
handle only a few out-of-town patients and had to work mostly with patients 
residing in the city. 


City-residents Non City-residents 
Treated Untreated Treated Untreated | 
Alive 5000 
Dead 5000 
A = alive 
B = treated 


C = city-residents 


The sample space may be partitioned first according to A and B; then accord- 
ing to A, B and C. The results are shown in the diagrams: 


1095 5050 


9005 5950 


B Be 


The various conditional probabilities, namely the classified proportions are 
as follows: 


P(4 | B) = ‘on = about 109% P(A| BC) = A005 

P(A | B’) = a = about 50% P(A| BC) = 05 
P(A | BC) = > 
P(A | BCs) = 200. 
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Thus if the results (a matter of life or death) are judged from the conditional 
probabilities in the left column, the treatment seems to be a disaster since it 
had decreased the chance of survival five times! But now look at the right 
column, for city-residents and non city-residents separately: 


P(A| BC) = 10%, P(A| BC) = 5%: 
P(A | BC’) = 95%, P(A | BeC’) = 50%. 


In both cases the chance of survival is doubled by the treatment. 

The explanation is this: for some reason (such as air pollution), the C 
patients are much less likely to recover than thé C* patients, and most of those 
treated were C patients. Naturally, a treatment is going to show a poor re- 
covery rate when used on the most seriously ill of the patients. 

The arithmetical puzzle is easily solved by the following explicit formulas 


P(AB) _ P(ABC) + P(ABC?) 
P(B) sé” 


_ P(ABC) P(BC) , P(ABC*) P(BC’) 
~ P(BC) P(B) P(BC’) P(B) 


= P(A| BC)P(C | B) + P(A | BC)P(C | B) 


_ 1000 10000 , 95 100 © 
~ J0000 10100 * 100 10100 


P(A | B’) = P(A | BeC)P(C | B*) + P(A | BeC)P(C | B’) 


— 50 1000 5000 10000 
~ 100011000 © 10000 11000 


P(A| B) = 


It is those “hidden coefficients’ P(C | B), P(C* | B), P(C | B°), P(C* | B°) that 
have caused a reverse. A little parable will clarify the arithmetic involved. 
Suppose in two families both husbands and wives work. Husband of family 
1 earns more than husband of family 2, wife of family 1 earns more than wife 
of family 2. For a certain good cause [or fun] both husband and wife of 
family 2 contribute half their monthly income; but in family 1 the husband 
contributes only 5% of his income, letting the wife contribute 95% of hers. 
Can you see why the poorer couple give more to the cause [or spend more 
on the vacation]? 

This example should be compared with a simpler analogue in Exercise 11, 
where there is no paradox and intuition is a sure thing. 


5.6.* Genetical models 


This section treats an application to genetics. The probabilistic model dis- 
cussed here is among the simplest and most successful in empirical sciences. 
Hereditary characters in diploid organisms such as human beings are 
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carried by genes which appear in pairs. In the simplest case each gene of a 
pair can assume two forms called alleles: A and a. For instance A may be 
“blue-eyed,” and a “brown-eyed” in a human being; or A may be “red blos- 
som” and a “‘white blossom” in garden peas, which were the original subject 
of experiment by Mendel [1822-1884]. We have then three genotypes: 


AA, Aa, aa, 


there being no difference between Aa and aA [nature does not order the pair]. 
In some characters, A may be dominant whereas a recessive so that Aa cannot 
be distinguished from AA in appearance so far as the character in question 
is concerned; in others Aa may be intermediate such as shades of green for eye 
color or pink for pea blossom. The reproductive cells, called gametes, are 
formed by splitting the gene pairs and have only one gene of each pair. At 
mating each parent therefore transmits one of the genes of the pair to the 
offspring through the gamete. The pure type AA or aa can of course transmit 
only A or a, whereas the mixed type Aa can transmit either A or a but not 
both. Now let us fix a gene pair and suppose that the parental genotypes AA, 
Aa, aa are in the proportions 


u:2v:w whereu>0,v>0,w>0,u-+ 20+ w= 1. 


[The factor 2 in 2v is introduced to simplify the algebra below.] The total 
pool of these three genotypes is very large and the mating couples are formed 
“at random” from this pool. At each mating, each parent transmits one of 
the pair of genes to the offspring with probability 1/2, independently of each 
other, and independently of all other mating couples. Under these circum- 
stances random mating is said to take place. For example, if peas are well 
mixed in a garden these conditions hold approximately; on the other hand 
if the pea patches are segregated according to blossom colors then the mating 
will not be quite random. 

The stochastic model can be described as follows. Two urns contain a 
very large number of coins of three types: with an A on each side, with one 
A and one a on each side, and with an a on each side. Their proportions are 
as u:2v:w for each urn. One coin is chosen from each urn in such a way 
that all coins are equally likely. The two chosen coins are then tossed and the 
two uppermost faces determine the genotype of the offspring. What is the 
probability that it be 4A, Aa or aa? In a more empirical vein and using the 
frequency interpretation, we may repeat the process a large number of times 
to get an actual sample of the distribution of the types. Strictly speaking, the 
coins must be replaced each time so that the probability of each type remains 
constant in the repeated trials. 

Let us tabulate the cases in which an offspring of type AA will result from 
the mating. Clearly this is possible only if there are at least two A-genes avail- 
able between the parents. Hence the possibilities are given in the first and 
second columns below. 
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Probability of 
Probability of producing 

Type of mating of offspring AA | Probability of 
female the couple | fromthecouple | offspring 4A 

AA uu= UW? ] u? 

Aa u-2v = 2w M uv 

AA 20-u = 2uw ve uv 

Aa 2v-2v = 4v? “4 v? 


In the third column we give the probability of mating between the two desig- 
nated genotypes in the first two entries of the same row; in the fourth column 
we give the conditional probability for the offspring to be of type AA given 
the parental types; in the fifth column the produgt of the probabilities in the 
third and fourth entries of the same row. By Proposition 2 of §5.2, the total 
probability for the offspring to be of type AA is given by adding the entries 
in the fifth column. Thus 


P(offspring is AA) = uw? + w + w+ v? = (ut v)*. 


From symmetry, replacing u by w, we get 


P(offspring is aa) = (v + w)*. 


Finally, we list all cases in which an offspring of type Aa can be produced, 
in a similar tabulation as the preceding one. 


Type of 
male 


Probability of 
Probability of producing 

Type of mating of offspring Aa_ | Probability of 
female the couple from the couple | offspring Aa 

Aa u-2v = 2w M% uv 

AA 2v-u = 2w MY uv 

aa u-w = uw ] uw 

AA w-u = uw ] uw 

aa 20-w = 2vw vw 


aN 

Q 

= 

i) 

eS 

I 

We) 

eS 

= 
Ke Ne OK 
Di DK DSN 

eS 

= 


Aa 20-20 = 4? 
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Hence we obtain by adding up the last column: 
P(offspring is Aa) = 2(uv + uw + ow + 0?) = 2(u+ v)(v + w). 
Let us put 
(5.6.1) p=u+r gq=v+w 


so that p > 0,q > 0,p + q = 1. Let us also denote by P,(- - -) the probability 
of the genotypes for offspring of the nth generation. Then the results obtained 
above are as follows: 


(5.6.2) P,(AA) = p?, P;(Aa) = 2pq, Pi(aa) = q’. 


These give the proportions of the parental genotypes for the second genera- 
tion. Hence in order to obtain P,, we need only substitute p? for u, pg for v 
and q’ for w in the two preceding formulas. Thus, 


PAA) = (p’ + pq) = p’, 
P(Aa) = 2(p? + pq)(pq + 9’) = 2pq, 
P,(aa) = (pq + 9’)? = q’. 


Lo and behold: P, is the same as P,;! Does this mean that P; is also the same 
as P;, etc.? This is true, but only after the observation below. We have shown 
that P; = P, for an arbitrary Py {in fact, even the nit-picking conditions u > 0, 
v > 0, w > 0 may be omitted]. Moving over one generation, therefore, P, = 
P;, even although P; may not be the same as Py. The rest is smooth sailing, 
and the result is known as Hardy-Weinberg theorem. (G. H. Hardy [1877- 
1947] was a leading English mathematician whose main contributions were 
to number theory and classical analysis.) 


Theorem. Under random mating for one pair of genes, the distribution of the 
genotypes becomes stationary from the first generation on, no matter what the 
original distribution is. 

Let us assign the numerical values 2, 1, 0 to the three types AA, Aa, aa 
according to the number of A-genes in the pair; and let us denote by X,, the 
random variable which represents the numerical genotype of the nth genera- 
tion. Then the theorem says that for n > I: 


(5.6.3) P(Xn = 2) = p*, P(Xn = 1) = 2pg, P(Xn = 0) = q’. 


The distribution of X,, is stationary in the sense that these probabilities do not 
depend on n. Actually it can be shown that the process {X,, n > 1} is strictly 
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stationary in the sense described in §5.4, because it is also a Markov chain; 
see Exercise 40 below. 

The result embodied in (5.6.2) may be reinterpreted by an even simpler 
model than the one discussed above. Instead of gene-pairs we may consider a 
pool of gametes, namely after the splitting of the pairs into individual genes. 
Then the A-genes and a-genes are originally in the proportion 


(Qu + 20v):(20 + 2w) = pig 


because there are two A-genes in the type AA, etc. Now we can think of these 
gametes as so many little tokens marked A or a in an urn, and assimilate the 
birth of an offspring to the random drawing (with replacement) of two of the 
gametes to form a pair. Then the probabilities of drawing AA, Aa, aa are 
respectively: 


Pp=pP, patq p= 2pq, vq=q’. 


This is the same result as recorded in (5.6.2). 

The new model is not the same as the old one, but it leads to the same 
conclusion. It is tempting to try to identify the two models on hindsight, but 
the only logical way of doing so is to go through both cases as we have done. 
A priori or prima facie, they are not equivalent. Consider, for instance, the 
case of fishes: the females lay billions of eggs first and then the males come 
along and fertilize them with sperm. The partners may never meet. In this 
circumstance the second model fits the picture better, especially if we use two 
urns for eggs and sperm separately. [There are in fact creatures in which 
sex is not differentiated and which suits the one-urn model.| Such a model 
may be called the spawning model, in contrast to the mating model described 
earlier. In more complicated cases where more than one pair of genes is in- 
volved, the two models need not yield the same result. 


Example 15. [t is known in human genetics that certain “bad’’ genes cause 
crippling defects or disease. If a is such a gene the genotype aa will not survive 
to adulthood. A person of genotype Aa is a carrier but appears normal be- 
cause a iS a recessive character. Suppose the probability of a carrier among 
the general population is p, irrespective of sex. Now if a person has an affected 
brother or sister who died in childhood, then he has a history in the family 
and cannot be treated genetically as a member of the general population. 
The probability of his being a carrier is a conditional one to be computed as 
follows. Both his parents must be carriers, namely of genotype Aa, for other- 
wise they could not have produced a child of genotype aa. Since each gene is 
transmitted with probability 1/2, the probabilities of their child to be AA, 
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Aa, aa are i 7 ; respectively. Since the person in question has survived he 


cannot be aa, and so the probability that he be 4A or Aa is given by 
1 2 
P(AA | AA LU Aa) = 3 P(Aa| AA U Aa) = 3" 


If he marries a woman who is not known to have a history of that kind in the 
family, then she is of genotype AA or Aa with probability | — p or p as for 
the general population. The probabilities for the genotypes of their children 
are listed below. 


Probability 
Fe- of the Probability of Probability of Probability of 
Male male combination producing AA producing Aa producing aa 
AA AA (1 — p) l 0 0 
l l l 
2 1 J 
Aa AA 3 (1 — p) 5 5 0 
2 l l ] 
Aq Aq 3P 4 2 4 


A simple computation gives the following distribution of the genotypes for 
the offspring: 


2 

PAA) = 5 — 5 
1 

Pda) = 3+@ 
P,(aa) = e 


The probability of a surviving child being a carrier is therefore 


P\(da| AA U Aa) = 22. 

6 — p 
If p is negligible, this is about 1/3. Hence from the surviving child’s point of 
view, his having an affected uncle or aunt is only half as bad a hereditary risk 
as his father’s having an affected sibling. One can now go on computing the 
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chances for his children, and so on—exercises galore left to the reader. 

In concluding this example which concerns a serious human condition, 
it is proper to stress that the simple mathematical theory should be regarded 
only as a rough approximation since other genetical factors have been ignored 
in the discussion. 


Exercises 


1. Based on the data given in Example 14 of §5.5, what is the probability 
that (a) a living patient resides in the city? (b) a living treated patient 
lives outside the city? 

2. All the screws in a machine come from the same factory but it is as likely 
to be from Factory A as from Factory B. The percentage of defective 
screws is 5% from A and 1% from B. Two screws are inspected; if the 
first is found to be good what is the probability that the second is also 
good? 

3. There are two kinds of tubes in an electronic gadget. It will cease to 
function if and only if one of each kind is defective, The probability that 
there is a defective tube of the first kind is .1; the probability that there 
is a defective tube of the second kind is .2. It is known that two tubes 
are defective, what is the probability that the gadget still works? 

4. Given that a throw of three unbiased dice shows different faces, what is 
the probability that (a) at least one is a six; (b) the total is eight? 

5. Consider families with three children and assume that each child is 
equally likely to be a boy or a girl. If such a family is picked at random 
and the eldest child is found to be a boy, what is the probability that 
the other two are girls? The same question if a randomly chosen child 
from the family turns out to be a boy. 

6. Instead of picking a family as in No. 5, suppose now a child is picked 
at random from all children of such families. If he is a boy, what is the 
probability that he has two sisters? 

7. Pick a family as in No. 5, and then pick two children at random from 
this family. If they are found to be both girls, what is the probability 
that they have a brother? 

8. Suppose that the probability that both twins are boys is a, and that both 
are girls is 8; suppose also that when the twins are of different sexes 
the probability of the first born being a girl is 1/2. If the first born of 
twins is a girl, what is the probability that the second is also a girl? 


2 
2’ 3° 4 
They shoot simultaneously and there are two hits. Who missed? Find 
the probabilities. 
10. On a flight from Urbana to Paris my luggage did not arrive with me. It 
had been transferred three times and the probabilities that the transfer 


9. Three marksmen hit the target with probabilities 3 respectively. 


Exercises 151 


12. 


13. 


14, 


15.* 


16. 


17. 


18. 


was not done in time were estimated to be respectively in the 


4 2 1 
10 10° 10 
order of transfer. What is the probability that the first airline goofed? 
Prove the “‘sure-thing principle’: if 


P(A|C) > P(B| C), 
P(A | C) > P(B| C9, 


then P(A) > P(B). 
Show that if A || B, then 


A || Be, AW BY, ACW B. 


Show that if d ( B = @, then 


(i) A|| Cand B|| C> (AU B)|| C; 
(ii) C|| A and C|| B= Cl] (A U B); 
(iii) A and C are independent, B and C are independent => A  B and 
C are independent. 


Suppose P(H) > 0. Show that the set function: 
S— P(S|H) for S€ Q (countable) 


is a probability measure. 

Construct examples for all the assertions in (5.5.11). [Hint: a system- 
atic but tedious way to do this is to assign pi, . . . , Dg to the eight atoms 
ABC, ..., A°B°C®* (see (1.3.5)) and express the desired inequalities by 
means of them. The labor can be reduced by preliminary simple choices 
among the p’s, such as making some of them zero and others equal. One 
can also hit upon examples by using various simple properties of a small 
set of integers; see an article by the author: “On mutually favorable 
events,”” Annals of Mathematical Statistics, Vol. 13 (1942), pp. 338-349. | 
Suppose that A,, 1 <j < 5, are independent events. Show that 


(1) (41 U A2)A3 and Ai LU AS are independent; 

(i1) Ay \U Ae, Az (\ Ag and AS are independent. 

Suppose that in a certain casino there are three kinds of slot machines 
in equal numbers with pay-off frequencies ; ; respectively. One 
of these machines paid off twice in four cranks; what is the probability 
of a pay-off on the next crank? 

A person takes four tests in succession. The probability of his passing 
the first test is p, that of his passing each succeeding test is p or p/2 
according as he passes or fails the preceding one. He qualifies provided 
he passes at least three tests. What is his chance of qualifying? 
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19. 


20. 


21. 


22. 
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An electric circuit looks as in 
the figure where the numbers 
indicate the probabilities of fail- 
ure for the various links, which 
are all independent. What is the 
probability that the circuit is 
in operation? 


an a 
4 4 


It rains half of the time in a certain city, and the weather forecast 1s 
correct 2/3 of the time. Mr. Milquetoast goes out every day and is much 
worried about rain. So he will take his umbrella if the forecast is rain, 
but he will also take it 1/3 of the time even if the forecast is no rain. Find 


(a) the probability of his being caught in rain without an umbrella; 
(b) the probability of his carrying an umbrella without rain. 


These are the two kinds of errors defined by Neyman and Pearson in 
their statistical theory. [Hint: compute the probability of “rain; forecast 
no rain; no umbrella,”’ etc.] 

Telegraphic signals ‘“‘dot”’ and “dash’’ are sent in the proportion 3:4. 
Owing to conditions causing very erratic transmission, a dot becomes 
a dash with probability 1/4, whereas a dash becomes a dot with proba- 
bility 1/3. If a dot is received what is the probability that it is sent as a 
dot? 

A says B told him that C had lied. If each of these persons tells the truth 
with probability p, what is the probability that C indeed lied? [Believe 
it or not, this kind of question was taken seriously one time under the 
name of ‘credibility of the testimony of witnesses.” In the popular 
phrasing given above it is grossly ambiguous, and takes a lot of words 
to explain the intended meaning. To cover one case in detail, suppose 
all three lied. Then B will tell A that C has told the truth, because B is 
supposed to know whether C has lied or not but decides to tell a lie 
himself; A will say that B told him that C had lied, since he wants to lie 
about what B told him, without knowing what C did. This is just one 
of the eight possible cases but the others can be similarly interpreted. 
A much clearer formulation is the model of transmission of signals used 
in No. 21. C transmits — or + according as he lies or not; then B trans- 
mits the message from C incorrectly or correctly according as he lies or 
not; then A transmits the message from B in a similar manner. There 
will be no semantic impasse even if we go on this way to any number 
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23. 


24. 


25. 


27. 


28. 


29. 


30. * 


31.* 


32. 


of witnesses. The question is: if “‘—”’ is received at the end of line, what 
is the probability that it is sent as such initially? ] 

A particle starts from the origin and moves on the line 1 unit to the 
right or left with probability 1/2 each, the successive movements being 
independent. Let Y, denote its position after n moves. Find the following 
probabilities: 


(a) P(Y, > Oforl <n < 4); 
(b) P(\Y,| <2 forl<n< 4); 
(c) P(Y, > O0forl<n<4| Ys = 0). 


In No. 23, show that if j < k <n, we have 


P(Y, = c| Y,=a, Y = b) = P(Y, = c| Y, = b) = P(Yn_-x = c — D) 

where a, b, c are any integers. Illustrate with j = 4, k = 6, n = 10, 
=2,b5=4,c= 6. 

First throw an unbiased die, then throw as many unbiased coins as the 

point shown on the die. 

(a) What is the probability of obtaining k heads? 

(b) If 3 heads are obtained what is the probability that the die showed n? 

In a nuclear reaction a certain particle may split into 2 or 3 particles, 

or not split at all. The probabilities for these possibilities are po, p3; and 

pi. The new particles behave in the same way and independently of each 

other as well as of the preceding reaction. Find the distribution of the 

total number of particles after two reactions. 

An unbiased die is thrown n times; let M and m denote the maximum 

and minimum points obtained. Find P(m = 2, M = 5). [Hint: begin 

with P(m > 2, M < 5).] 

Let X and Y be independent random variables with the same probability 

distribution {p,,n > 1}. Find P(X < Y) and PLY = Y). 


In Problems 29-32, consider two numbers picked at random in [0, 1]. 


If the smaller one is less than 1/4, what is the probability that the larger 
one is greater than 3/4? 

Given that the smaller one is less than x, find the distribution of the 
larger one. [Hint: consider P(min < x, max < y) and the two cases 
x<ypandx> y.| 

The two points picked divide [0,1] into three segments. What is the 
probability that these segments can be used to form a triangle? (Hint: 
this is the case if and only if the sum of lengths of any two is greater 
than the length of the third segment. Call the points X and Y and treat 
the case X < Y first. | 

Prove that the lengths of the three segments mentioned above have the 
same distributions. [Hint: consider the distribution of the smaller 
value picked, that of the difference between the two values, and use a 
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33. 


34. 


35.” 


36.* 


37.* 


38. 


39. 


40. 
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symmetrical argument for the difference between | and the larger value. | 
In Pélya’s urn scheme find: 


(a) PCR; | BiR2); 
(c) PCRs | Ro); 
(d) P(R: | ReR3); 
(e) P(Ri | Re); 
(f) P(R:| Rs). 


Consider two urns U, containing r, red and 5, black balls respectively, 
i= 1, 2. A ball is drawn at random from U; and put into U,, then a ball 
is drawn at random from U, and put into urn U,. After this what is the 
probability of drawing a red ball from U,? Show that if b} = 1, bp = rp, 
then this probability is the same as if no transfers have been made. 
Assume that the a priori probabilities p in the sunrise problem (Example 


9 of §5.2) can only take the values i I< k < 100, with probability 


‘a0 each. Find P(S*t! | S$”). Replace 100 by N and let N— ©, what is 
the limit? 
Prove that the events Ai,..., A, are independent if and only if 


where each A, may be A, or A§. [Hint: to deduce these equations from 
independence, use induction on n and also induction on k in 
P(A{ +++ AgAnsi +++ An); the converse is easy by induction on n.] 

Spell out a proof of Theorem 2 in §5.3. [Hint: label all balls and show 
that any particular sequence of balls has the same probability of occu- 
pying any given positions if all balls are drawn in order. | 

Verify Theorem 5 of §5.4 directly for the pairs (Xi, X2), (X%, X3) and 
(Xo, X3). 

Assume that the three genotypes AA, Aa, aa are in the proportion 
p?:2pq:q’, where p + gq = 1. If two parents chosen at random from the 
population have an offspring of type 4A, what is the probability that 
another child of theirs is also of type 4A? Same question with AA re- 
placed by Aa. 

Let X1 and X, denote the genotype of a female parent and her child. 
Assuming that the unknown genotype of the male parent is distributed 
as in Problem No. 39 and using the notation of (5.6.3), find the nine 
conditional probabilities below: 


P{X,=k| X, = jh, j= 0,1, 2; k = 0, 1, 2. 


These are called the transition probabilities of a Markov chain; see 
§8.3. 


Exercises 155 


41.* Prove that if the function ¢ defined on [0, ©) is nonincreasing and satis- 


42. 


fies the Cauchy functional equation 
As + t) = o(s)ot), s20,t > 0; 


then g(t) = e—' for some \ > 0. Hence a positive random variable T 
has the property 


PT>s+t\|T>s)=PT>1, s>0,t>0 


if and only if it has an exponential distribution. [Hint: ¢(0) = 1: 
g(1/n) = a” wherea = ¢(1), o(m/n) = a”; if m/n < t < (m+ 1)/n 
then at)/" < g(t) < a”; hence the general conclusion follows by let- 
ting n> ~.] 

A needle of unit length is thrown onto a table which is marked with 
parallel lines at a fixed distance d from one another, where d > 1. Let 
the distance from the midpoint of the needle to the nearest line be x, 
and let the angle between the needle and the perpendicular from its mid- 
point to the nearest line be @. It is assumed that x and 6 are independent 
random variables, each of which is uniformly distributed over its range. 
What is the probability that the needle intersects a line? This is known as 
Buffon’s problem and its solution suggests an empirical [Monte Carlo] 
method of determining the value of 7. 


Chapter 6 


Mean, Variance and Transforms 


6.1. Basic properties of expectation 


The mathematical expectation of a random variable, defined in §4.3, is 
one of the foremost notions in probability theory. It will be seen to play the 
same role as integration in calculus—and we know “integral calculus” is at 
least half of all calculus. Recall its meaning as a probabilistically weighted 
average [in a countable sample space] and rewrite (4.3.11) more simply as: 


(6.1.1) E(X) = >| X(@)P(e). 


If we substitute |X| for XY above, we see that the proviso (4.3.12) may be 
written as 


(6.1.2) E(\X|) < ~. 


We shall say that the random variable X is summable when (6.1.2) 1s satisfied. 
In this case we say also that ‘‘X has a finite expectation (or mean)” or “‘its 
expectation exists.’’ The last expression is actually a little vague because we 
generally allow ECX) to be defined and equal to ++ when for instance X > 0 
and the series in (6.1.1) diverges. See Exercises 27 and 28 of Chapter 4. We 
shall say so explicitly when this is the case. 

It is clear that if X is bounded, namely when there exists a number M 
such that 


|X(w)| < M for allw € Q, 


then XY is summable and in fact 


F(X) = X |X@)| Pe) < MZ Pl) = M. 


In particular if Q is finite then every random variable is bounded (this does 
not mean all of them are bounded by the same number). Thus the class of 
random variables having a finite expectation is quite large. For this class the 
mapping 


(6.1.3) X— E(X) 


assigns a number to a random variable. For instance, if X is the height of 
students in a school, then EX) is their average height; if X is the income of 
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wage earners, then E(X) is their average income; if X is the number of vehi- 


cles passing through a toll bridge in a day, then ECX) is the average daily 
traffic, etc. 


If A is an event, then its indicator [, (see §1.4) is a random variable, and 
we have 


Ea) = P(A). 


In this way the notion of mathematical expectation is seen to extend that of 
a probability measure. 

Recall that if X and Y are random variables, then so is X + Y (Proposi- 
tion 1 of §4.2). If X and Y both have finite expectations, it 1s intuitively clear 
what the expectation of X + Y should be. Thanks to the intrinsic nature of 
our definition, it is easy to prove the theorem. 


Theorem 1. Jf X and Y are summable, then so is X + Y and we have 
(6.1.4) E(X + Y) = E(X)+ E(Y). 
Proof: Applying the definition (6.1.1) to X + Y, we have 
E(X+ Y)= dX (X(w) + Y())P@) 
= E X(w)P(w) + 2X Y(w)P@) = E(X) + E(Y). 


This is the end of the matter. You may wonder wherever do we need the 
condition (6.1.2)? The answer is: we want the defining series for EX + Y) 
to converge absolutely, as explained in §4.3. This is indeed the case because 


~ |X) + Y@)| P@) < X (XO) + | YO))PE) 
= LX |X@)| Po) + L | ¥@)| Pe) <@. 


Innocuous or ingenuous as Theorem | may appear, it embodies the most 
fundamental property of E. There is a pair of pale sisters as follows: 


(6.1.5) E(a) = a, E(aX) = aE(X) 
for any constant a; and combining (6.1.4) and (6.1.5) we obtain 
(6.1.6) E(aX + bY) = aE(X) + bEC(Y) 


for any two constants a and b. This property makes the operation in (6.1.3) 
a ‘linear operator.” This is a big name in mathematics; you may have heard 
of it in linear algebra or differential equations. 

An easy extension of (6.1.4) by mathematical induction yields: if 1, 
Xo,..., Xn, are summable random variables, then so is their sum and 
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(6.1.7) E(X + +++ + Xn) = EX) + +++ + ECX,). 


Before we take up other properties of E, let us apply this to some interest- 
ing problems. 


Example 1. A raffle lottery contains 100 tickets, of which there is one ticket 
bearing the prize $10000, the rest being all zero. If I buy two tickets, what is 
my expected gain? 

If I have only one ticket, my gain is represented by the random variable 
X which takes the value 10000 on exactly one w and 0 on all the rest. The 
tickets are assumed to be equally likely to win the prize, hence 


10000 with probability im 


xX = 
; 4:,. 99 
0 with probability 100 
and 
E(X) = 10000 - +19. 7. 100 
7 100 100 — 


Thus my expected gain is $100. This is trivial, but now if I have two tickets 
I know very well only one of them can possibly win, so there is definite 
interference [dependence] between the two random variables represented by 
the tickets. Will this affect my expectation? Thinking a bit more deeply: if I 
am not the first person to have bought the tickets, perhaps by the time I get 
mine someone else has already taken the prize, albeit unknown to all. Will 
it then make a difference whether I get the tickets early or late? Well, these 
questions have already been answered by the urn model discussed in §5.3. 
We need only assimilate the tickets to 100 balls of which exactly one is black. 
Then if we define the outcome of the nth drawing by X,, we know from 
Theorem 1 there that X, has the same probability distribution as the X 
shown above, and so also the same expectation. For n = 2 this was computed 
directly and easily without recourse to the general theorem. It follows that 
no matter what j and k are, namely for any two tickets drawn anytime, the 
expected value of both together is equal to 


E(X, + Xi) = E(X;) + E(X) = 2E(X) = 200. 


More generally, my expected gain is directly proportional to the number of 
tickets bought—a very fair answer, but is it so obvious in advance? In 
particular if I buy all 100 tickets I stand to win 100 ECXY) = 10000 dollars. 
This may sound dumb but it checks out. 

To go one step further, let us consider two lotteries of exactly the same 
kind. Instead of buying two tickets X and Y from the same lottery, I may 
choose to buy one from each lottery. Now I have a chance to win $20000. 
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Does this make the scheme more advantageous to me? Yet Theorem | says 
that my expected gain is $200 in either case. How is this accounted for? To 
answer this question you should figure out the distribution of X + Y under 
each scheme and compute E(X + Y) directly from it. You will learn a lot 
by comparing the results. 


Example 2. [Coupon collecting problem.|] There are N coupons marked 1 
to N in a bag. We draw one coupon after another with replacement. Suppose 
we wish to collect r different coupons, what is the expected number of 
drawings to get them? This is the problem faced by school children who 
collect baseball star cards; or by housewives who can win a sewing machine if 
they have a complete set of coupons which come in some detergent boxes. 
In the latter case the coupons may well be stacked against them if certain 
crucial ones are made very scarce. Here of course we consider the fair case 
in which all coupons are equally likely and the successive drawings are 
independent. 

The problem may be regarded as one of waiting time, namely: we wait 
for the rth new arrival. Let X;, X2,... denote the successive waiting times 
for a new coupon. Thus X; = | since the first is always new. Now X2 is the 
waiting time for any coupon that is different from the first one drawn. Since 
at each drawing there are N coupons and all but one of them will be new, 


this reduces to Example 8 of §4.4 with success probability p = eae ; hence 


N 
E(X2) = No 1 


After these two different coupons have been collected, the waiting time for 


— 


the third new one is similar with success probability p = Wo hence 


N 
E(X3) = F5" 


Continuing this argument, we obtain for 1 < r< N: 
N N N 
E(x + -:- +X) = G+ eat vee T Wore 


In particular if r = N, then 


(6.1.8) BU bo bX) = N (145+ +5); 


and if N is even andr = N/2, 
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N N 


(6.1.9) EX, +--+ Xys) = nuf(—e +... +4). 
7 tl 


Now there is a famous formula in mathematical analysis which says that 


(6.1.10) Lt 5t--+5=logNt+Ctev, 


where the “log” is the natural logarithm to the base e, C is the Euler’s 
constant = .5772 --- [nobody in the world knows whether it is a rational 
number or not], and ey tends to zero as N goes to infinity. For most purposes, 
the more crude asymptotic formula is sufficient: 


; l | 


Noo lo N 


If we use this in (6.1.8) and (6.1.9), we see that for large values of N, the 
quantities there are roughly equal to N log N and N log 2 = about a N re- 
spectively [how does one get log 2 in the second estimate?]. This means: 
whereas it takes somewhat more drawings than half the number of coupons 
to collect half of them, it takes “infinitely” more drawings to collect all of 
them. The last few items are the hardest to get even if the game is not rigged. 

A terrifying though not so unrealistic application is to the effects of aerial 
bombing in warfare. The results of the strikes are pretty much randomized 
under certain circumstances such as camouflage, decoy, foul weather and 
intense enemy fire. Suppose there are 100 targets to be destroyed but each 
strike hits one of them at random, perhaps repeatedly. Then it takes ‘‘on the 
average’ about 100 log 100 or about 460 strikes to hit all targets at least once. 
Thus if the enemy has a large number of retaliatory launching sites, it will be 
very hard to knock them all out without accurate military intelligence. The 
conclusion should serve as a mathematical deterrent to the preemptive strike 
theory. 


6.2. The density case 
To return to saner matters, we will extend Theorem | to random variables 
in an arbitrary sample space. Actually the result is true in general, provided 


the mathematical expectation is properly defined. An inkling of this may be 
given by writing it as an abstract integral as follows: 


(6.2.1) E(X) = I, X(w) dw, 


where “‘dw’’ denotes the probability of the “element at w,”’ as is commonly 
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done for an area or volume element in multi-dimensional calculus—the so- 
called “‘differential.” In. this form (6.1.4) becomes 


(6.2.2) [. (X(w) + ¥(w)) dw = [, X(w) dos + [. Y¥(w) des, 


which is in complete analogy with the familiar formula in calculus: 


[, £0) + ae dx = [f@)dx t f ex) dx, 


where / is an interval, say [0, 1]. Do you remember anything of the proof of 
the last equation? It is established by going back to the definition of [Rie- 
mann] integrals through approximation by [Riemann] sums. For the 
probabilistic integral in (6.2.1) a similar procedure is followed. It is defined to 
be the limit of mathematical expectations of approximate discrete random 
variables [alluded to in §4.5]. These latter expectations are given by (6.1.1) 
and Theorem | 1s applicable to them. The general result (6.2.2) then follows 
by passing to the limit. 

We cannot spell out the details of this proper approach in this text be- 
cause it requires some measure theory, but there is a somewhat sneaky way 
to get Theorem | in the case when (X, Y) has a joint density as discussed in 
§4.6. Using the notation there, in particular (4.6.7), we have 


(6.2.3) E(X) = [. uf (u, *) du, E(Y) = [. of (x, 0) dv. 
On the other hand, if we substitute (x, y) = x + y in (4.6.8), we have 
(6.2.4) EX+Y¥)= [*. [. (u + v)f(u, v) du db. 


Now this double integral can be split and evaluated by iterated integration: 


[. udu | [°F v) dv | +- [. v av | [Fe v) du | 


= [. uf (u, *) du + [. of (*, v) dv. 


Comparison with (6.2.3) establishes (6.1.4). 

The key to this method is the formula (6.2.4) whose proof was not given. 
The usual demonstration runs like this. “Now look here: if X takes the value 
u and Y takes the value v, then X + Y takes the value u + v, and the prob- 
ability that X = u and Y = vis f(u, v) du dv. See?” This kind of talk must 
be qualified as hand-waving or brow-beating. But it is a fact that applied 
scientists find such “demonstrations” quite convincing and one should go 
along until a second look becomes necessary, if ever. For the present the 
reader is advised to work out Exercise 40 below, which 1s the discrete analogue 
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of the density argument above and is perfectly rigorous. These methods will 
be used again in the next section. 


Example 3. Recall Example 5 in §4.2. The S,’s being the successive times 
when the claims arrive, let us put 


Si = Ti, S. — Sy; = To,..., Sn — Sn = Tn,.... 


Thus the T;,’s are the inter-arrival times. They are significant not only for our 
example of insurance claims, but also for various other models such as the 
“idle-periods”’ for sporadically operated machinery, or ‘“‘gaps’’ in a traffic 
pattern when the 7’s measure distance instead of time. In many applications 
it is these 7’s that are subject to statistical analysis. In the simplest case we 
may assume them to be exponentially distributed as in Example 12 of §4.5. 
If the density is \e—*” for all T,, then E(T,) = 1/d. Since 


S,= M+ on + T, 
we have by Theorem | in the density case: 


n 


E(S,) = E(Ti) + +++ + ED.) = © 


Observe that there is no assumption about the independence of the 7’s, so 
that mutual influence between them 1s allowed. For example, several accidents 
may be due to the same cause such as a 20-car smash-up on the freeway. 
Furthermore, the 7’s may have different \’s due e.g., to diurnal or seasonal 
changes. If E(T,) = 1/X, then 


] l 
HS) = +00 $5 


We conclude this section by solving a problem left over from §1.4 and 
Exercise 20 in Chapter 1; cf., also Problem 6 in §3.4. 


Poincaré’s formula. For arbitrary events Ay,..., An we have 


(6.2.5) P( U A,) = >> P(A,) - 2 P(A,Ax) + 2, P(A,A;A)1) 
j= dj J) Jy, 
— +... + (—1)" P(A, -++ An), 
where the indices in each sum are distinct and range from I to n. 


Proof: Let a, = I4, be the indicator of A,. Then the indicator of Af --- An 


is I (1 — a,), hence that of its complement is given by 
j3=1 
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Tau ++» U4 = 1— I 1 — a) = Daj — DL ajo 
j=1 j j,k 


t 2, 25a — bee + (Dar ++ On. 
Jit, 


Now the expectation of an indicator random variable is just the probability 
of the corresponding event: 


EUs) = P(A). 


If we take the expectation of every term in the expansion above, and use 
(6.1.7) on the sums and differences, we obtain (6.2.5). [Henri Poincaré 
[1854-1912] was called the last universalist of mathematicians; his contri- 
butions to probability theory is largely philosophical and pedagogical. The 
formula above is a version of the “‘inclusion-exclusion principle” attributed 
to Sylvester. | 


Example 4 [Matching Problem or problem of rencontre|. Two sets of cards 
both numbered | to m are randomly matched. What is the probability of at 
least one match? 


Solution. Let A; be the event that the jth cards are matched, regardless of 
the others. There are n! permutations of the second set against the first set, 
which may be considered as laid out in natural order. If the jth cards match, 
that leaves (n — 1)! permutations for the remaining cards, hence 


(6.2.6) P(A) =“ = -. 


Similarly if the jth and kth cards are both matched, where j ¥ k, that leaves 
(n — 2)! permutations for the remaining cards, hence 


4gy-@2 2 1. 
(6.2.7) P(A;Ax) — n! — n(n _ 1)’ 
next if j, k, / are all distinct, then 
| _@—3)! | 1 
P(A;ArA) = n! ss n(n — 1)(n — 2) 


and so on. Now there are (") terms in the first sum on the right side of 


(6.2.5), (5) terms in the second, (3) in the third, etc. Hence altogether the 


right side is equal to 
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(i) (3) ae t Dee eee - FFD 
4! 
+ 3 


_, 1 
—4t. - + (— Lye 


— | — 
Everybody knows (?) that 


6) — n—-1 
poeera1—-t 4) 4b ge epe t ey... _ x l) . 


2! 3! n! noi a! 


This series converges very rapidly, in fact it is easy to see that 
l l l J 
— el — — — — ce -— |)r—1 — —_________. 
r é (1 myea7 bo te a) $ @aep 


] l l wae 
_———_ < —_—_— i 
Hence as soon as n > 4, (n+l! <5 100° and the probability of at 


least one match differs from 1 — e~! =~ .63 by less than 1%. In other words, 


the probability of no match is about .63 for all n > 4. 
What about the expected number of matches? The random number of 
matches is given by 


(6.2.8) N=14,+ ++: + Jh,. 


[Why? Think this one through thoroughly and remember that the A’s are 
neither disjoint nor independent.] Hence its expectation is, by another 
application of Theorem 1: 


BN) = ¥ EUs) = & PA) =n = 1, 


namely exactly | for all n. This is neat, but must be considered as a numerical 
accident. 


6.3. Multiplication theorem; variance and covariance. 


We have indicated that Theorem | is really a general result in integration 
theory which is widely used in many branches of mathematics. In contrast, 
the next result requires stochastic independence and is special to probability 
theory. 


Theorem 2. Jf X and Y are independent summable random variables, then 


(6.3.1) E(XY) = E(X)E(Y). 
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Proof: We will prove this first when 2 is countable. Then both X and Y 
have a countable range. Let {x,;} denote all the distinct values taken by X, 
similarly { y,} for Y. Next, let 


Ax = {w| X@) = x,, Ye) = yi}, 
namely A,; is the sample subset for which X¥ = x, and Y = y,. Then the sets 


A,,, aS (j,k) range over all pairs of indices are disjoint [why?] and their 
union is the whole space: 


Q= VS Aj. 
7 Ok 
The random variable XY takes the value x,y, on the set Aj, but some of 
these values may be equal, e.g., for x; = 2, y. = 3 and x; = 3, wy, = 2. 
Nevertheless we get the expectation of XY by multiplying the probability of 


each A,, with its value on the set, as follows: 


(6.3.2) E(XY) = 2 2, Xi VKP(A jx). 


This is a case of (4.3.15) and amounts merely to a grouping of terms in the 
defining series >, X(w) Y(w)P(w). Now by the assumption of independence, 


P(Ajx) = P(X = x,)P(Y = yi). 


Substituting this into (6.3.2), we see that the double sum splits into simple 
sums as follows: 


3 2, X,YiP(X = xj)PCY = yr) 
j 
= {UxP(X = X)1{X YeP(Y = Ye)\ = E(X)ECY). 
7 i 
Here again the reassembling is justified by absolute convergence of the 


double series in (6.3.2). 


Next, we prove the theorem when (X, Y) has a joint density function /, 
by a method similar to that used in §6.2. Analogously to (6.2.4), we have 


E(XY) = [. [. uvf (u, v) du dv. 
Since we have by (5.5.9), using the notation of §4.6: 


fu, v) = flu, *)f(*, 2), 


the double integral can be split as follows: 


[7 uf, ») du fof, 0) dv = BODE). 
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Strictly speaking, we should have applied the calculations first to |X| and 
| Y |. These are also independent by Proposition 6 of §5.5, and we get 


E(\XY|) = E(\X))E(|Y|) < @. 


Hence XY is summable and the manipulations above on the double series 
and double integral are valid. [These fussy details often distract from the 
main argument but are a necessary price to pay for mathematical rigor. 
The instructor as well as the reader is free to overlook some of these at his 
own discretion. | 

The extension to any finite number of independent summable random 
variables is immediate: 


(6.3.3) E(X, «++ Xp) = E(X,) «+» ECX,). 


This can be done directly or by induction. In the latter case we need that 
X,X2 is independent of X3, etc. This is true and was mentioned in §5.5— 
another fussy detail. 

In the particular case of Theorem 2 where each_X, is the indicator of an 
event A,, (6.3.3) reduces to 


P(A, «++ An) = P(Ay) +++ P(An). 


This makes it crystal clear that Theorem 2 cannot hold without restriction 
on the dependence. Contrast this with the corresponding case of (6.1.7): 


Ea, + +++ + In.) = P(A) +--+ + P(A,). 


Here there is no condition whatever on the events such as their being 
disjoint, and the left member is to be emphatically distinguished from 
P(A; U --+ UA,) or any other probability. This is the kind of confusion 
which has pestered pioneers as well as beginners. It is known as Cardano’s 
paradox. [|Cardano (1501-1576) wrote the earliest book on games of chance. | 


Example 5. Iron bars in the shape of slim cylinders are test-measured. 
Suppose the average length is 10 inches and average area of ends is | square 
inch. The average error made in the measurement of the length is .005 inch, 
that in the measurement of the area is .01 square inch. What is the average 
error made in estimating their weights? 

Since weight is a constant times volume it is sufficient to consider the 
latter: V = LA where L = length, A = area of ends. Let the errors be AL 
and AA respectively; then the error in V is given by 


AV =(L+AL)(A + AA) — LA = LAA + AAL 4+ ALAA. 


Assuming independence between the measurements, we have 
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E(AV) = E(L)E(AA) + E(A)E(AL) + E(AA)E(AL) 
| | | | 
100 + | ° 300 * 700 ° 300 


= .105 cubic inch if the last term is ignored. 


= 10- 


Definition of Moment. For positive integer r, the mathematical expectation 
E(X*) is called the rth moment [moment of order r| of X. Thus if X* has a 
finite expectation, we say that X has a finite rth moment. For r = 1 of 
course the first moment is just the expectation or mean. 

The case r = 2 is of special importance. Since X? > 0, we shall call ECX?) 
the second moment of X whether it is finite or equal to + according as the 
defining series [in a countable Q] >> X?(w)P(w) converges or diverges. 


When the mean E(X) is finite, it is often convenient to consider 
(6.3.4) X9= X — E(X) 


instead of X because its first moment is equal to zero. We shall say X° is 
obtained from X by centering. 


Definition of Variance and Standard Deviation. The second moment of X® 
is called the variance of X and denoted by o°(X); its positive square root o(X) 
is called the standard deviation of X. 


There is an important relation between E(X), E(X’) and o?(X), as follows. 


Theorem 3. If E(X?) is finite, then so is E(| X |). We have then 


(6.3.5) o(X) = E(X") — E(XS; 
consequently 
(6.3.6) E(| X |? < ECX?). 


Proof: Since 
X—2|X(+1=( X|— Ys, 


we must have (why?) E(X? — 2| X| + 1)> 0, and therefore E(X?) + 1 > 
2E(| X |) by Theorem 1 and (6.1.5). This proves the first assertion of the 
theorem. Next we have 


o(X) = Et(X — E(X)Y} = ELX? — 2E(X)X + EXP} 
= E(X") — 2E(X)E(X) + EXP = E(X?) — E(X/) 
Since o°(X) > 0 from the first equation above, we obtain (6.3.6) by substi- 
tuting | X | for X in (6.3.5). 
What is the meaning of o(X)? To begin with, X° is the deviation of X 
from its mean, and can take both positive and negative values. If we are only 


interested in its magnitude then the mean absolute deviation is E(|X°|) = 
E(|X — E(X)|). This can actually be used but it is difficult for calculations. 
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So we consider instead the mean square deviation E(|X°|?) which is the 
variance. But then we should cancel out the squaring by extracting the root 


afterward, which gives us the standard deviation + V E(|X°|2). This then is 
a gauge of the average deviation of a random variable [sample value] from 
its mean. The smaller it is, the better the random values cluster around its 
average and the population is well centered or concentrated. The true 
significance will be seen later when we discuss the convergence to a normal 
distribution in Chapter 7. 

Observe that X and X + c have the same variance for any constant c; in 
particular, this is the case for X and X°. The next result resembles Theorem 1, 
but only in appearance. 


Theorem 4. If X and Y are independent and both have finite variances, then 
(6.3.7) o(X + Y) = 0X) + oY). 


Proof: By the preceding remark, we may suppose that X and Y both have 
mean zero. Then X + Y also has mean zero and the variances in (6.3.7) are 
the same as second moments. Now 


E(XY) = E(X)E(Y) = 0 
by Theorem 2, and 


EQ(X + Y)} = BUX? + 2XY¥ + Y?} 
= E(X?) + 2E(XY) + E(Y?) = E(X?) + E(Y’) 


by Theorem 1, and this is the desired result. 

The extension of Theorem 4 to any finite number of independent random 
variables is immediate. However, there is a general formula for the second 
moment without the assumption of independence, which is often useful. We 
begin with the algebraic identity: 


(Wt +X =U VP42 YL XM. 
j3=1 1<j<k<n 


Take expectations of both sides and using Theorem 1, we obtain 
(6.3.8) EM +--+ + Xi) = LAX )+2 Yo XX). 
7=1 1<j<k<n 


When the X’s are centered and assumed to be independent, then all the 
mixed terms in the second sum above vanish and the result is the extension 
of Theorem 4 already mentioned. 

Let us introduce two real indeterminants |dummy variables] £ and 7» and 
consider the identity: 
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EX(XE + Yn)?} = E(X%)E? + 2E(XY) em + ECY?)n?. 


The right member is a quadratic form in (£, 7) and it is never negative because 
the left member is the expectation of a random variable which does not take 
negative values. A well-known result in college algebra says that the coeffi- 
cients of such a quadratic form aé + 2bé + cn? must satisfy the inequality 
5? < ac. Hence in the present case 


(6.3.9) E(XYY < E(X2E(Y?). 


This is called the Cauchy-Schwarz inequality. 
If X and Y both have finite variances, then the quantity 


E(X°Y°) = E{(X — E(X))(Y — E(Y))} 

E{XY — XE(Y) — YE(X) + E(X)E(Y)} 
E(XY) — EX)E(Y) — E(Y)E(X) + E(X)E(Y) 
E(XY) — E(X)E(Y) 


is called the covariance of X and Y and denoted by Cov (X, Y); the quantity 


Cov (CX, Y) 
a(X)o( Y) 


is called the coefficient of correlation between X and Y, provided of course 
the denominator does not vanish. If it is equal to zero, then X and Y are said 
to be uncorrelated. This is implied by independence but is in general a weaker 
property. As a consequence of (6.3.9), we have always —1 < p(X, y) < 1. 
The sign as well as the absolute value of p gives a sort of gauge of the mutual 
dependence between the random variables.} See also Exercise 30 of Chapter 7. 


AX, Y) = 


Example 6. The most classical application of the preceding results is to the 
case of Bernoullian random variables (see Example 9 of §4.4). These are 
independent with the same probability distribution as follows: 


_ {1 with probability p; 
(6.3.10) X= 15, with probability g = 1 — p. 


We have encountered them in coin-tossing (Example 8 of §2.4), but the 
scheme can be used in any repeated trials in which there are only two out- 
comes: success (X = 1) or failure (¥ = 0). For instance, Example 1 in §6.1 is 
the case where p = 1/100 and the monetary unit is “ten grand.”’ The chances 
of either “cure” or “death” in a major surgical operation is another illus- 
tration. 


t The mathematician Emil Artin told me the following story in 1947. “Everybody knows 
that probability and statistics are the same thing, and statistics is nothing but correlation. 
Now the correlation 1s just the cosine of an angle. Thus, all is trivial.” 
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The mean and variance of X are easy to compute: 
E(X) = p, o%(X) = p — p’ = pq. 
Let {X,,n > 1} denote Bernoullian random variables and write 
(6.3.11) S,= X+:-:-+X, 


for the nth partial sum. It represents the total number of successes in n trials. 
By Theorem 1, 


(6.3.12) E(Sn) = E(X1) + +++ + E(Xn) = np. 
This would be true even without independence. Next by Theorem 3, 
(6.3.13) a(S,) = o°(X) + +--+ + 0°(X,) = npg. 


The ease with which these results are obtained shows a great technical 
advance. Recall that (6.3.12) has been established in (4.4.16), via the binomial 
distribution of S, and a tricky computation. A similar method is available 
for (6.3.13) and the reader is strongly advised to carry it out for practice and 
comparison. But how much simpler is our new approach, going from the 
mean and variance of the individual summands to those of the sum without 
the intervention of probability distributions. In more complicated cases the 
latter will be very hard if not impossible to get. That explains why we are 
devoting several sections to the discussion of mean and variance which often 
suffice for theoretical as well as practical purposes. 


Example 7. Returning to the matching problem in §6.2, let us now compute 
the standard deviation of the number of matches. The J4,’s in (6.2.8) are not 
independent, but formula (6.3.8) is applicable and yields 


BUN) = YB) +2 Eat 


<7 <k<n 


Clearly, 
E(I4,) = P(A,) = 


ba 


E(\laJa,) = P(A;Ax) = aad 


by (6.2.6) and (6.2.7). Substituting into the above, we obtain 


nn—-b~itl=4 


BN?) = n+ +2(5) 
Hence 
o(N) = E(N’) — EN)? =2-—-1=1. 
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Rarely an interesting general problem produces such simple numerical 
answers. 


6.4. Multinomial distribution 


A good illustration of the various notions and techniques developed in the 
preceding sections is the multinomial distribution. This is a natural generali- 
zation of the binomial distribution and serves as a model for repeated trials 
in which there are a number of possible outcomes instead of just “success or 
failure.’ We begin with the algebraic formula called the multinomial theorem: 


(6.4.1) (a4f---- +x) => - n! 


xP os xh 
Nyt s+: M,! 


where the sum ranges over all ordered r-tuples of integers m,. . . , n, Satisfying 
the following conditions: 


(6.4.2) m>0,...,54% 20, mee: +n = HN. 


When r = 2 this reduces to the binomial theorem. For then there are n + 1 
ordered couples 


(0,n),(l,n—1),...,(k,n—k),...,(m, 9) 
with the corresponding coefficients 


n! n! n! n! 
Oln! Wn — DP CR n- bY’ *? n!0! 


GGG) 


Hence the sum can be written explicitly as 


(5) xoyn + (‘) xlyr! + wee + (7) xkyn-k +- wae + (") xry? 


== x (7) xkynk, 


In the general case the n identical factors (x, + --- + x,) on the left side of 
(6.4.1) are multiplied out and the terms collected on the right. Each term is 
of the form xi’ --- xr’ with the exponents n; satisfying (6.4.2). Such a term 
appears n!/(m! --- n,!) times because this is the number of ways of permuting 
n objects (the n factors) which belong to r different varieties (the x’s), such 
that n, of them belong to the jth variety. [You see some combinatorics are 
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in the nature of things and cannot be avoided even if you have skipped most 
of Chapter 3. | 

A concrete model for the multinomial distribution may be described as 
follows. An urn contains balls of r different colors in the proportions: 


Pit+-++tp,, Where pi + --- +p, = 1. 


We draw n balls one after another with replacement. Assume independence 
of the successive drawings, which is simulated by a thorough shake-up of 
the urn after each drawing. What is the probability that so many of the balls 
drawn are of each color? 

Let X%,..., X, be independent random variables all having the same 
distribution as the X below: 


1 with probability pu, 
(6.4.3) ye 2 with probability po, 


r_ with probability p,. 


What is the joint distribution of (%,..., Xn), namely 
(6.4.4) P(X = X%,..., Xn = Xn) 
for all possible choices of x1,...,X, from 1 to r? Here the numerical values 


correspond to labels for the colors. Such a quantification is not necessary 
but sometimes convenient. It also leads to questions which are not intended 
for the color scheme, such that the probability “Xi + --- + X, = m.” But 
suppose we change the balls to lottery tickets bearing different monetary 
prizes, or to people having various ages or incomes, then the numerical 
formulation in (6.4.3) is pertinent. What about negative or fractional values 
for the X’s? This can be accommodated by a linear transformation (cf. 
Example 14 in §4.5) provided all possible values are commensurable, say 
ordinary terminating decimals. For example if the values are in 3 decimal 
places and range from — 10 up, then we can use 


X’ = 10000 + 1000X 


instead of X. The value —9.995 becomes 10000 — 9995 = 5 in the new 
scale. In a super-pragmatic sense, we might even argue that the multinomial 
distribution is all we need for sampling in independent trials. For in reality 
we Shall never be able to distinguish between (say) 10!°" different varieties of 
anything. But this kind of finitist attitude would destroy a lot of mathematics. 

Let us evaluate (6.4.4). It is equal to P(X1 = x1) --- P(Xn = Xn) by 
independence, and each factor is one of the p’s in (6.4.3). To get an explicit 
expression we must know how many of the x,’s are 1 or 2 or -- - or r? Suppose 
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n, of them are equal to 1, nm, of them equal to 2,...,n, of them equal to r. 
Then these n,’s satisfy (6.4.2) and the probability in question is equal to 
pi' --- p;’. It is convenient to introduce new random variables N,, 1 <j <r, 
as follows: 


N, = number of X’s among (Xi,..., X,) that take the value /. 


Each N; takes a value from 0 to n, but the random variables Ni,..., N; 
cannot be independent since they are subject to the obvious restriction: 


However, their joint distribution can be written down: 


n! ns 
(6.4.6) P(N, = m,...,N, = 1,) = mie. ni? os Dr. 


The argument here is exactly the same as that given at the beginning of this 
section for (6.4.1), but we will repeat it. For any particular, or completely 
specified, sequence (Xi,..., X,) satisfying the conditions N, = Ny..., 
N, = n,, we have just shown that the probability is given by p7' --- p”. But 
there are n!/(m! --- n,!) different particular sequences satisfying the same 
conditions, obtained by permuting the n factors of which n, factors are Pi, Ne 
factors are pe, etc. To nail this down in a simple numerical example, let 
n= 4,r = 3,m = 2,n. = ns = 1. This means in 4 drawings there are 2 balls 
of color 1, and 1 ball each of color 2 and 3. All the possible particular se- 
quences are listed below: 


(1, 1, 2, 3) (1, 1, 3, 2) 
(1, 2, 1, 3) (1, 3, 1, 2) 
(1, 2, 3, 1) (1, 3, 2, 1) 
(2, 1, 1, 3) (3, 1, 1, 2) 
(2, 1, 3, 1) (3, 1, 2, 1) 
(2, 3, 1, 1) (3, 2, 1, 1) 


Their number is 12 = aa and the associated probability is 12pipops. 


Formula (6.4.6), in which the indices n, range over all possible integer 
values subject to (6.4.2), is called the multinomial distribution for the random 
variables (Mi,..., N,). Specifically, it may be denoted by Min;r; pi,..., 
Pr—1, Pr) Subject to pi: + --- +p, = 1. For the binomial case r = 2, this is 
often written as B(n; p), see Example 9 of §4.4. 

If we divide (6.4.1) through by its left member, and put 


= icj<r 
P; Xptees + x, SJS&S 
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we obtain 


ni 


° mn Ny 
tt 
m!--- nt? P 


(6.4.7) l=> 
This yields a check that the sum of all the terms of the multinomial distribu- 
tion is indeed equal to 1. Conversely, since (6.4.7) is a consequence of its 
probabilistic interpretation, we can deduce (6.4.1) from it, at least when 
x; => 0, by writing (p1 + --- + p,)” for the left member in (6.4.7). This is 
another illustration of the way probability theory can add a new meaning to 
an algebraic formula; cf. the last part of §3.3. 

Marginal distributions (see §4.6) of (M,...,N,) can be derived by a 
simple argument without computation. If we are interested in Ni alone, then 
we can lump the r — 1 other varieties as “‘not 1” with probability 1 — py. 
Thus the multinomial distribution collapses into a binomial one B(n; pi), 
namely: 


n! ni _ 
P(N, = m) = min — np? — pyr-™. 


From this we can deduce the mean and variance of N, as in Example 6 of 
§6.3. In general, 


(6.4.8) E(N;) = np, o~(Nj) = np, — p,), 1Sj<gr. 


Next, if we are interested in the pair (M1, Ne), a similar lumping yields 
M(n; 3; pi, Pe ps), namely: 


(6.4.9) P(N, = Ny, Ne = No) 


n} Mn (] n—ni—n2 
= iniml(n — m — mpi PEPE ~ Pi ~ Pr) 
We can now express E(N,N2) by using (4.5.4) or (6.3.2) [without independ- 
ence]: 

ECN,N2) = >, nynoP(N, = Nh, Ne = No) 


n! 11 Ne .N3 
mino!ne! NyNep1i P2 P3 
where 13 = 1 — m — Ne, ps = 1 — pi — pe and the sum ranges as in (6.4.2) 
with r = 3. The multiple sum above can be evaluated by generalizing the 
device used in Example 9 of §4.4 for the binomial case. Take the indicated 
second partial derivative below: 


02 
OX10X2 


(x1 + xe + a)* = mn — Do + a by 


n! m—1one—1 ns 
ny!no! ng! 


NyNyX1 X2 X3 


-r 
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Multiply through by x,x. and then put x1 = pi, X2 = Po, X3 = p3. The result 


is n(n — 1)pip» on one side and the desired multiple sum on the other. Hence 
we have in general for j ¥ k: 


E(N;Nx) = n(n — 1)p, px; 


(6.4.10) Cov (N,, Nix) = ECN,Nx) — ECN;)ECN:) 
= n(n — 1)p,pe — (np;)(mpx) = —Np,Pr- 


It is fun to check out the formula (6.3.8), recalling (6.4.5): 


n? = E((N, + --- + N,)?} 


3 {n(n — lpi +np}+2 DY nn— Dpypr 
j= 1 1<7<k<n 


r 2 r 
n(n — D( = ps) +n 2, Pi = n(n — l+n=n?. 


There is another method of calculating E(N,Na, similar to the first 
method in Example 6, §6.3. Let j be fixed and 


_ fl ifx=y, 
ex) = { if x j. 


As a function é of the real variable x, it is just the indicator of the singleton 
{j}. Now introduce the random variable 


_ _ l if X=, 
(6.4.11) t, = X,) = {5 ae 


namely, é, is the indicator for, the event {X, = j}. In other words the £,’s 
count just those X,’s taking the value j, so that N; = & + --- &. Now we 
have 


E(é,) = P(X, = J) = D3; 
a(t) = ECE) — E(é,)? = p,; — py = p,(1 — py). 


Finally, the random variables £,...,£&, are independent since %,..., Xn 
are, by Proposition 6 of §5.5. Hence by Theorems 1 and 4: 


E(N,) = E(fi) + +++ + EG.) = npy, 


(6.4.12) 
o°(N;) = o%(&) +--+ + o%(,) = np,(1 — p,). 


Next, let kK # j and define y and 7, in the same way as £ and &, are defined, 
but with k in place of 7. Consider now forl1 <»<n, 1 <v' <n: 
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(6.4.13) Em) = P(X, = 7, Xv = By = LPP thy wy, 
0 ify = pv’. 


Finally we calculate 


E{(x 2103 = w)f ~ ape She et} 


x Aeny+ D Bew) 


E(N,Nx) 


n(n — 1)p,px; 


by (6.4.13) because there are (n)2 = n(n — 1) terms in the last written sum. 
This is of course the same result as in (6.4.10). 

We conclude with a simple numerical illustration of a general problem 
mentioned above. 


Example 8. Three identical dice are thrown. What is the probability of obtain- 
ing a total of 9? The dice are not supposed to be symmetrical and the prob- 
ability of turning up face j is equal to p;, 1 < j < 6; same for all three dice. 

Let us list the possible cases in terms of the X’s and the N’s respectively: 


permutation 
Xi Xo X3 N, No N3 N, Ns Ne number probability 


1/2] 6 l | l 6 6PiPsPs 

l 3 5 l l l 6 6p1psPs 

1/4/41] 1 2 3 3p1p' 

2/2] 5 2 l 3 33s 

2|3 1 4 l l l 6 6P2Ps3P4 

3 | 3 |] 3 3 l D3 
Hence 


P(X + X + X3 = 9) = 6( PiPoPs + PipsPs + P2PsP4) + 3( pips ++ p2ps) + pi. 


If all the p’s are equal to 1/6, then this is equal to 


6+6434+3+6+1 25 
63 ~ 216 


The numerator 25 is equal to the sum 


where the n,’s satisfy the conditions 
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My + Ne + mg + ng + 5 + Ng = 3, 
ny + 2no + 3n3 + 4n, + S55 + 6ng = Y. 


There are 6 solutions tabulated above as possible values of the N,’s. 

In the general context of the X’s discussed above, the probability 
P(% + --- + X, = m) is obtained by summing the right side of (6.4.6) 
over all (m,...,7,) Satisfying both (6.4.2) and 


Inj + 2m. + --- +n, = mM. 


It 1s obvious that we need a computing machine to handle such explicit 
formulas. Fortunately in most problems we are interested in cruder results 
such that 


Plan < Xi + +++ Xn < Dn) 


for large values of n. The relevant asymptotic results and limit theorems will 
be the subject matter of Chapter 7. One kind of machinery needed for this 
purpose will be developed in the next section. 


6.5. Generating function and the like 


A powerful mathematical device, a true gimmick, is the generating function 
invented by the great prolific mathematician Euler [1707-1783] to study the 
partition problem in number theory. Let X be a random variable taking only 
nonnegative integer values with the probability distribution given by 


(6.5.1) P(X =j)=a, j=0,1,2,.... 


The idea is to put all the information contained above in a compact capsule. 
For this purpose a dummy variable z is introduced and the following power 
series in Z set up: 


(6.5.2) 2(Z) = Ao t+ aztaz*+.-. = > a,z, 
7=0 


This is called the generating function associated with the sequence of numbers 

{a,,j = 0}. In the present case we may also call it the generating function of 

the random variable X with the probability distribution (6.5.1). Thus g is a 

function of z which will be regarded as a real variable, although in some more 

advanced applications it is advantageous to consider z as a complex variable. 

Remembering that >. a, = 1, it is easy to see that the power series in (6.5.2) 
Jj 


converges for |z| < 1. In fact it is dominated as follows: 


lg(z)| < Xo |a,| |zl? << Sa, = 1, for |z| < 1. 
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[It is hoped that your knowledge about power series goes beyond the ‘‘ratio 
test.”” The above estimate is more direct and says a lot more.] Now a theorem 
in calculus asserts that we can differentiate the series term by term to get the 
derivatives of g, so long as we restrict its domain of validity to |z| < 1; 


oo 


g'(z) = ay + 2QoZ + 3a32? + a > nNa,z”™, 


n=l 


(6.5.3) . 
g''(z) = 2a, + 6032 + --- = DY n(v — Da,z"?. 
n=2 


In general we have 


(6.5.4) gz) = x n(n —1)+--(n-jt Daz = > (") jlagz™?, 


n=j 


If we set z = 0 above, all the terms vanish except the constant term: 


(6.5.5) g?(0) = jla, or a, = 5 gO). 


This shows that we can recover all the a,’s from g. Therefore, not only does 
the probability distribution determine the generating function, but also vice 
versa. So there is no loss of information in the capsule. In particular, putting 
z = | in g’ and g” we get by (4.3.18) 


(I) = X nay = EX), 91) = YL na — Dna, = BUC) — BOO; 
provided that the series converge, in which case (6.5.3) holds for z = 1}. Thus 


(6.5.6) E(X) = gl), E(X’) = gl) + gC). 


In practice, the following qualitative statement, which is acorollary of the 
above, is often sufficient. 


Theorem 5. The probability distribution of a non negative integer-valued 
random variable is uniquely determined by its generating function. 

Let Y be a random variable having the probability distribution {b,, k > 0} 
where b, = P(Y = k), and let / be its generating function: 


h(z) = 3 b,z*. 
k=0 


Suppose that g(z) = A(z) for all |z| < 1, then the theorem asserts that a, = b, 
for all k > 0. Consequently X and Y have the same distribution, and this is 
what we mean by “unique determination.” The explicit formula (6.5.4) of 
course implies this, but there 1s a simpler argument as follows. Since 


t This is an Abelian theorem; cf. the discussion after (8.4.17). 
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> AZ = > b,z*, |z| <1; 
k=0 k=0 


we get at once ao = by by setting z = 0 in the equation. After removing these 
terms we can cancel a factor z on both sides, and then get a, = b, by again 
setting z = 0. Repetition of this process establishes the theorem. You ought 
to realize that we have just reproduced a terrible proof of a standard result 
which used to be given in some calculus text books! Can you tell what is 
wrong there and how to make it correct? 

We proceed to discuss a salient property of generating functions when 
they are multiplied together. Using the notation above we have 


2(z)h(z) = (= az) (= bz) = 2 2 a,b,z?**, 


We will rearrange the terms of this double series into a power series in the 
usual form. Then 


(6.5.7) a(z)h(z) = x oy! 
where 
1 
(6.5.8) Ci = > a;by = > a;bi_,. 
jtk=l 7=0 


The sequence {c,} is called the convolution of the two sequences {a;} and 
{b,}. What does c; stand for? Suppose that the random variables X and Y 
are independent. Then we have by (6.5.8), 


a= YS PW(X=/pP¥=1—/ 


j=0 


l 
= 0 P(X =j, ¥=1—j)=PX+ Y=). 
jJ= 


The last equation above is obtained by the rules in §5.1, as follows. Given 
that X = j, we have X + Y = /if and only if Y = / — j, hence by Proposi- 
tion 2 of §5.2 [cf. (5.2.4)]: 


P(X+Y=)= PX = P(X + Y=I|X=/) 


I 
Ms 


P(X = pP(Y = 1—j| X=j) 


S 
li 
i) 


P(X = j)P(Y = 1 — J), 


I 
KS 
tM-=~ 
oO 
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because Y is independent of X and it does not take negative values. In other 
words, we have shown that for all / > 0, 


PX+ Y=/))= ci, 


so that {c:,/ > 0} is the probability distribution of the random variable 
X + Y. Therefore, by definition its generating function is given by the power 
series in (6.5.7) and so equal to the product of the generating functions of X 
and of Y. After an easy induction, we can state the result as follows. 


Theorem 6. Jf the random variables X,,...,X, are independent and have 
21,.--52n as their generating functions, then the generating function of the 
sum X; + «++ + X,, is given by the product 21 --- Zn. 

This theorem is of great importance since it gives a method to study sums 
of independent random variables via generating functions, as we shall see in 
Chapter 7. In some cases the product of the generating function takes a 
simple form and then we can deduce its probability distribution by looking 
at its power series, or equivalently by using (6.5.5). The examples below will 
demonstrate this method. 

For future reference let us take note that given a sequence of real numbers 
{a;}, we can define the power series g as in (6.5.2). This will be called the 
generating function associated with the sequence. If the series has a nonzero 
radius of convergence, then the preceding analysis can be carried over to 
this case without the probability interpretations. In particular, the convolu- 
tion of two such sequences can be defined as in (6.5.8), and (6.5.7) is still valid. 
In §8.3 below we shall use such generating functions whose coefficients are 
probabilities; then (6.5.3) is valid for |z| < 1, but the series may diverge for 
z= 1. 


Example 9. For the Bernoullian random variables X%,..., X, (Example 6 
of §6.3), the common generating function is 


g(z) = q+ pz 
since Q = g, a = p in (6.5.1). Hence the generating function of S, = 


X; +--+ + X,, where the X’s are independent, is given by the nth power 
of g: 


8(Z)” = (q + pz)”. 


Its power series is therefore known from the binomial theorem, namely, 


g(z)” = x (7) qr "pt z*. 


=0 


On the other hand, by definition of generating functions, we have 
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g(z)" = x PCS, = k)z*. 
Comparison of the last two expressions shows that 
P(S, = k) = (; 2) pa O<k <n; PIS, =k) =0, k>n. 


This is the Bernoulli formula we learned sometime ago, but the derivation 
is new and it is machine-processed. 


Example 10. For the waiting time distribution (§4.4), we have p, = q’~'p, 
jJ = 1; hence 


(6.5.9) g(Z) = x g?pz! _ Pp 7 (qz)) = _ i —- _ Pe. 


Let S, = 7; + --- +7, where the T’s are independent and each has the g 
in (6.5.9) as generating function. Then S, is the waiting time for the nth 
success. Its generating function is given by g”, and this can be expanded into 
a power series by using the binomial series and (5.4.4): 


sor = (Bg) =e & (Ge 


— qz 
= FED AID apg = & (MFI7N) gegen 
= 2% ji prqiz i= 2h n—-] pqizti 
— - k—1 nyk—n ok 
== (p> )e a 


Hence we obtain for j > 0, 


P(S, =n+j) = (" i 7 ') pg’ = (—") pay. 


The probability distribution given by {(- ") Pp —@q)',j => 0} i is called the 


J 


negative binomial distribution of order n. The discussion above shows that its 
generating function is given by 


(6.5.10) (2) _ (; ey 


Now g(z)/z is the generating function of 7; — 1 (why?), which repre- 
sents the number of failures before the first success. Hence the generating 
function in (6.5.10) is that of the random variable S,, — n, which is the tota! 
number of failures before the nth success. 
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Example 11. For the dice problem at the end of §6.4, we have p, = 1/6 for 
I<j <6 if the dice are symmetrical. Hence the associated generating 
function is given by 


g@) = Gets t steht zs + zy = FW. 


The generating function of the total points obtained by throwing 3 dice is 
just g*. This can be expanded into a power series as follows: 


3 —_. 76\3 3 
g(z)? = a tToan = 5; (1 — 326 + 3z!2 — 218] — z)-3 


_2 —_ 6 12 _. 18 . (“5 ’) k 
= @ (i 3z6 + 32 Zz ) 2 7 z* 


The coefficient of z® is easily found by inspection, since there are only two 
ways of forming it from the product above: 


w(t (642) 3. (04%) B3_ 
63 2 2 /f" 6 ~ 6 


You may not be overly impressed by the speed of this new method, as 
compared with a combinatorial counting done in §6.4, but you should observe 
how the machinery works: 


Step 1°: Code the probabilities {P(X = /), 7 > 0}, into a generating 
function g; 

Step 2°: Process the function by raising it to nth power g”; 

Step 3°: Decode the probabilities {P(S, = k),k > 0} from g” by ex- 
panding it into a power series. 


A characteristic feature of machine process is that parts of it can be performed 
mechanically such as the manipulations in (6.5.10). We need not keep track 
of what we are doing at every stage: plug something in, push a few buttons 
or crank some knobs, and out comes the product. To carry this gimmickry 
one step further, we will now exhibit the generating function of X in a form 
Euler would not have recognized [the concept of a random variable came 
late, not much before 1930]: 


(6.5.12) g(z) = E(z*) 
namely the mathematical expectation of z*. Let us first recall that for each z, 


the function w — z*™) is indeed a random variable. For countable © this is a 
special case of Proposition 2 in §4.2. When X takes the value /, z* takes the 
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value z?, hence by (4.3.15) the expectation of z* may be expressed as 
% P(X = j)z' which is g(2). 
7=0 


An immediate pay-off is a new and smoother proof of Theorem 6 based 
on different principles. The generating function of X; + --- + X, is, by 
what has just been said, equal to 


E(z*'+ oe + Xn) — E(z*'z*? gt tar te zXn) 


by the law of exponentiation. Now the random variables z*', z*?, ..., 2%» 
are independent by Proposition 6 of §5.5, hence by Theorem 2 of §6.3 


(6.5.13) E(2XizXt ... 2X) = (2X1) E(z%2) «.« E(zX*), 


Since E(z*:) = g,(z) for each j this completes the proof of Theorem 4. 

Another advantage of the expression E(z*) is that it leads to extensions. If 
X can take arbitrary real values this expression still has a meaning. For sim- 
plicity let us consider only 0 < z < 1. Every such z can be represented as 
e— with 0 < \ < ©, in fact the correspondence z = e~ is one-to-one; see 
Figure 25. 


Figure 25 


Now consider the new expression after such a change of variable: 
(6.5.14) E(e*), OS A<o~. 


If X has the probability distribution in (6.5.1), then 
E(e*) = > a,e—™* 
7=0 


which is of course just our previous g(z) with z = e~*. More generally if X 
takes the values {x,} with probabilities {p,;}, then 


(6.5.15) E(e—>*) = pie ™ 
J 
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provided that the series converges absolutely. This is the case if all the values 
x, = 0 because then e~*” < 1 and the series is dominated by > p, = 1. 
Jj 


Finally if X has the density function /, then by (4.5.6) with g(u) = e~™: 
(6.5.16) E(e*) = Pf. e—f'(u) du, 


provided that the integral converges. This is the case if f(u) = 0 for u < 0, 
namely when X does not take negative values. We have therefore extended 
the notion of a generating function through (6.5.14) to a large class of random 
variables. This new gimmick is called the Laplace transform of X. In the 
analytic form given on the right side of (6.5.16) it is widely used in operational 
calculus, differential equations, and engineering applications. 

If we replace the negative real parameter —\d in (6.5.14) by the purely 
imaginary i#, where i = V —1 and @ is real, we get the Fourier transform 
E(e“®*); in probability theory it is also known as the characteristic function 
of X. Let us recall De Moivre’s formula (which used to be taught in high 
school trigonometry courses), for real u: 


e“~= cosu+isinu; 


and its consequence 
lel? = (cos u)? + (sin u)? = 1. 


This implies that for any real random variable X, we have |e¥*| = 1; hence 
the function ¢: 


(6.5.17) (6) = E(e®*X), —0 <<a, 


is always defined, in fact |y(6)} < 1 for all 6. Herein lies the superiority of 
this new transform over the others discussed above, which cannot be defined 
sometimes because the associated series or integral does not converge. On 
the other hand, we pay the price of having to deal with complex variables and 
functions which lie beyond the scope of an elementary text. Nevertheless, we 
will invoke both the Laplace and Fourier transforms in Chapter 7 and for 
future reference let us record the following theorem. 


Theorem 7. Theorems 5 and 6 remain true when the generating function is 
replaced by the Laplace transform (for non-negative random variables) or the 
Fourier transform (for arbitrary random variables). 

In the case of Theorem 6, this is immediate from (6.5.13) if the variable 
z there is replaced by e~—* or e”. For Theorem 5 the analogues lie deeper and 
require more advanced analysis (see [Chung 1; Chapter 6]). The reader 
is asked to accept their truth by analogy from the discussion above leading 
from E(z*) to E(e~*) and E(e“*). After all, analogy is a time-honored 
method of learning. 
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8.* 


Exercises 


The Massachusetts state lottery has 1000000 tickets. There is one first 
prize of $50000; 9 second prizes of $2500 each; 90 third prizes of $250 
each; 900 fourth prizes of $25 each. What is the expected value of one 
ticket? five tickets? 

Suppose in the lottery above only 80% of the tickets are sold. What is 
the expected total to be paid out in prizes? If each ticket is sold at 50¢, 
what is the expected profit for the state? 

Five residental blocks are polled for racial mixture. The number of 
houses having black or white owners are listed below 


1 2 3 4 5 


Black 3 2 4 3 4 

White 10 10 9 11 10 
If two houses are picked at random from each block, what is the ex- 
pected number of black-owned ones among them? 
Six dice are thrown once. Find the mean and variance of the total 
points. Same question if the dice are thrown n times. 
A lot of 1000 screws contain 1% with major defects and 5% with minor 
defects. If 50 screws are picked at random and inspected, what are the 
expected numbers of major and minor defectives? 
In a bridge hand what is the expected number of spades? of different suits? 
[Hint: for the second question let X, = 1 or 0 according as the /th suit is 
represented in the hand or not; consider E(X1 + X_ + X3 + X4).] 
An airport bus deposits 25 passengers at 7 stops. Assume that each 
passenger is as likely to get off at any stop as another and that they act 
independently. The bus stops only if someone wants to get off. What is 
the expected number of stops it will make? [Hint: let X, = 1 or 0 ac- 
cording as someone gets off at the jth stop or not.] 
Given 500 persons picked at random, (a) What is the probability that 
more than one of them have January | as birthday? (b) What is the 
expected number among them who have this birthday? (c) What is 
the expected number of days of the year that are birthdays of at least 
one of these persons? (d) What is the expected number of days of the 
year that are the birthdays of more than one of these persons? Ignore 
leap years for simplicity. [Hint: for (b), (c), (d), proceed as in No. 7.] 
Problems 6, 7 and 8 are different versions of occupancy problems which 
may be formulated generally as follows. Put n unnumbered tokens into 
m numbered boxes (see §3.3). What is the expected number of boxes 
which get exactly [or at least] kK tokens? One can also ask for instance: 
what is the expected number of tokens which do not share its box with 
any other token? Answer these questions and rephrase them in the 
language of Problem 6, 7 or 8. 
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Using the occupancy model above, find the distribution of the tokens 
in the boxes, namely, the probabilities that exactly n, tokens go into 
the jth box, 1 <j < m. Describe this distribution in the language of 
Problem 6, 7 or 8. 
An automatic machine produces a defective item with probability 2°%. 
When this happens an adjustment is made. Find the average number 
of good items produced between adjustments. 
Exactly one of six similar looking keys is known to open a certain door. 
If you try them one after another how many do you expect to have tried 
before the door is opened? 
One hundred electric light bulbs are tested. If the probability of failure 
is p for each bulb, what is the mean and standard derivation of the 
number of failures? Assume stochastic independence of the bulbs. 
Fifty persons queue up for chest X-ray examinations. Suppose there 
are four “‘positive’’ cases among them. What is the expected number of 
‘negative’ cases before the first positive case is spotted? [Hint: think 
of the four as partitioning walls for the others. Thus the problem is 
equivalent to finding the expected number of tokens in the first box 
under (IV’) of §3.3.] 
There are N coupons numbered | to N in a bag. Draw one after another 
with replacement. (a) What is the expected number of drawings until 
the first coupon drawn is drawn again? (b)* What is the expected 
number of drawings until the first time a duplication occurs? [Hint: 
for (b) compute first the probability of no duplication in n drawings. | 
In the problem above, what is the expected maximum coupon-number 
in n drawings? The same question if the coupons are drawn without 
replacement. [Hint: find P(maximum < k).] 
In Pélya’s urn scheme with c > —1 (see §5.4), 
(a) What is the expected number of red balls in drawings? 
(b) What is the expected number of red balls in the urn after the nth 
drawing (and putting back c balls)? 


If pn > 0, andr, = >= p, show that 
k=n 


[> 0] [> 0] 
D~ MPn = Don 
n=1 n=l 


whether both series converge or diverge to +. Hence if X is a random 
variable taking nonnegative integer values, we have 


(6.6.1) E(X) = x P(X > n). 


[Hint: Write pp = rn — rn41, rearrange the series (called Abel’s method 
of summation in some calculus textbooks). | 
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24. 


25. 


26. 


27. 


Apply the formula (6.6.1) to compute the mean waiting time discussed 
in Example 8 of §4.4. Note that PLY > n) = q*,n > 0. 
Let X,..., Xm, be independent nonnegative integer-valued random 


variables all having the same distribution {p,,n > 0}; andr, = 2» Pk 


Show that 


E{min(X%,..., Xn)} = x Tn. 


(Hint: use No. 18.] 
Let X be a nonnegative random variable with density function f. Show 


that if r(u) = [ ” F(t) dt, then 
(6.6.2) E(X) = i ° P(X > u) du = [ ” (u) du. 


(Hint: this is the analogue of No. 18. Calculation with integrals is 
smoother than with sums.] 

Apply formula (6.6.2) to an X with the exponential density \e—™. 

The duration T of a certain type of telephone call is found to satisfy 
the relation 


PT > t) = ae*+ (1 — ae’, t>0; 


where 0 << a< 1, > 0, uw > O are constants determined statistically. 
Find the mean and variance of T. [Hint: for the mean a quick method 
is to use No. 21.] 

Suppose that the “‘life’? of an electronic device has the exponential 
density \e—*“ in hours. Knowing that it has been in use for 7 hours, 
how much longer can it be expected to last? Compare this with its 
initial life expectancy. Do you see any contradiction? 

Let five devices described above be tested simultaneously. (a) How long 
can you expect before one of them fails? (b) How long can you expect 
before all of them fail? 

The average error committed in measuring the diameter of a circular 
disk is .2% and the area of the disk is computed from this measurement. 
What is the average percentage error in the area if we ignore the square 
of the percentage error in the diameter? 

Express the mean and variance of aX + 6b in terms of those of X, where 
a and 6 are two constants. Apply this to the conversion of temperature 
from Centigrade to Fahrenheit: 
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28. 


29. 


30. 


31. 


32. 


33. 


Mean, Variance and Transforms 


A gambler figures that he can always beat the house by doubling his 
bet each time to re-coup any losses. Namely he will quit as soon as he 
wins, otherwise he will keep doubling his ante until he wins. The only 
drawback to this winning system is that he may be forced to quit when 
he runs out of funds. Suppose that he has a capital of $150 and begins 
with a dollar bet, and suppose he has an even chance to win each time. 
What is the probability that he will quit winning, and how much will he 
have won? What is the probability that he will quit because he does not 
have enough left to double his last ante, and how much will he have 
lost in this case? What is his overall mathematical expectation by using 
this system? The same questions if he will bet all his remaining capital 
when he can no longer double. 

Pick n points at random in [0,1]. Find the expected value of the 
maximum, minimum, and range (= maximum minus minimum). 
Consider n independent events A, with P(A;) = p;, 1<j<n. Let N 
denote the (random) number of occurrences among them. Find the 
generating function of N and compute E(N) from it. 

Let {p,, 7 > 0} be a probability distribution and 


k 
uy = > Pis 
7=0 
2(z)= dX uz". 
k=0 


Show that the power series converges for |z| < 1. As an example let 
S, be as in Example 9 of §6.5, so that {p,} is a binomial distribution. 
What is the meaning of u,? Find its generating function g. 

It is also possible to define the generating function of a random variable 
which takes positive and negative values. To take a simple case, if 


P(X =k) =p, kK=0,+1,+2,...,+N, 
then 


+N 
e(z)= Dd przt 
k=—N 


is a rational function of z, namely the quotient of two polynomials. 


l ; , 
INI above, which corresponds to the uniform 
distribution over the set of integers {— N, —(N — 1),..., —1,0, +1, 

., N — 1, N}. Compute the mean from g’ for a check. 


Let {X,, 1 <j <n} be independent random variables such that 


Find g when p;, = 
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34.* 


35. 


36. 


37. 


38. 


1 with probability ? 
X;= 4 0. with probability 7 


—1 with probability i 


and S, = x X;. Find the generating function of S, in the sense of 
j=l 


No. 32, and compute P(S, = 0) from it. As a concrete application, 
suppose A and B toss an unbiased coin n times each. What is the 
probability that they score the same number of heads? [This problem 
can also be solved without using generating function, by using formula 
(3.3.9). ] 

In the coupon collecting problem of No. 15, let T denote the number 
of drawings until a complete set of coupons is collected. Find the 
generating function of T. Compute the mean from it for a beautiful 
check with (6.1.8). [Hint: Let T, be the waiting time until j different 
coupons are collected; then it has a geometric distribution with 
bp; = (N —j + 1)/N. The T,’s are independent. | 

Let X and g be as in (6.5.1) and (6.5.2). Derive explicit formulas for the 
first four moments of X in terms of g and its derivatives. 

Denote the Laplace transform of X in (6.5.16) by L(A). Express the mth 
moment of X in terms of L and its derivatives. 

Find the Laplace transform corresponding to the density function f 
given below. 


(a) f(u) = * in (0,0), > 0. 
(b) f(u) = “ in(0,c),c > 0. 


(c) f(u) = a) e in [0,0), \ > 0, n > 1. [First verify that this 


is a density function! The corresponding distribution is called the 
gamma distribution T(n; d).] 


Let S, = 71+ --- +7, where the 7;’s are independent random 
variables all having the density \e~*. Find the Laplace transform of 
S,. Compare with the result in No. 37(c). We can now use Theorem 7 
to find P(a < S, < b) for any real numbers a < b. 

Consider a population of N taxpayers paying various amounts of taxes, 
and suppose the mean is m and variance is o?. If n of these are selected 
at random, show that the mean and variance of their total taxes are 
equal to 
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40. 


41. 


42.* 


43. 


44, 
45, 


46.* 


47.* 


Mean, Variance and Transforms 


N-n_, 
nm and as ad 
respectively. [Hint: denote the amounts by %,..., X, and use (6.4.8). 


Some algebra may be saved by noting that ECX,X;) does not depend on 
n, so it can be determined when n = N, but this trick is by no means 
necessary. | 

Prove Theorem | by the method used in the proof of Theorem 2. Do 
the density case as well. 

Let a(-) and b(-) be two probability density functions and define their 
convolution c(-) as follows: 


c(v) = [. au)b(v — u)duy —~ KLV< mw; 


cf. (6.5.7). Show that c(-) is also a probability density function, often 
denoted by a * b. 
If a(u) = Xe—™ for u > 0, find the convolution of a(-) with itself. Find 
by induction the n-fold convolution a*a*--- *a. [Hint: the result 

CG 

n times 

is given in No. 37(c).| 
Prove Theorem 4 for nonnegative integer-valued random variables by 
using generating functions. [Hint: express the variance by generating 
functions as in No. 35 and then use Theorem 6. | 
Prove the analogues of Theorem 6 for Laplace and Fourier transforms. 
Consider a sequence of independent trials each having probability p 
for success and q for failure. Show that the probability that the nth 
success is preceded by exactly / failures is equal to 


n+I—1) mg 
( j pq. 


Prove the formula 


. (mrs lN (n+ k— 1) _ (m+ne it) 
jtke=l J k | 


where the sum ranges over all j > 0 and k > 0 such that / +k =1. 
[Hint: this may be more recognizable in the form 


EAGT) =") 


cf. (3.3.9). Use (I — 2)-™(1 — 2" = (1 — 2)" ] 


The general case of the problem of points (Example 6 of §2.2) is as 
follows. Two players play a series of independent games in which A has 
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48.* 


probability p, B has probability q = 1 — p of winning each game. 
Suppose that 4 needs m and B needs » more games to win the series. 
Show that the probability that A will win is given by either one of the 
expressions below: 


m+tn-1 —_ 
(i) du (” vr ‘) i a 


k=m 
- not k—1 
(ii) » (” + k ) pq. 
k=0 


The solutions were first given by Montmort (1678-1719). [Hint: 
solution (i) follows at once from Bernoulli’s formula by an obvious 
interpretation. This is based on the idea (see Example 6 of §2.2) to 
complete m + n — 1 games even if A wins before the end. Solution (i1) 
is based on the more natural idea of terminating the series as soon as 
A wins m games before B wins n games. Suppose this happens after 
exactly m + k games, then A must win the last game and also m — 1 
among the first m + k — 1 games, and k <n — 1.] 

Prove directly that the two expressions (i) and (ii) given in No. 47 are 
equal. [Hint: one can do this by induction on n, for fixed m; but a more 
interesting method is suggested by comparison of the two ideas involved 
in the solutions. This leads to the expansion of (ii) into 


n=l k-1 
s (” r k ) prat(p t+ gy 
k=0 


k—-1 k —I-k 
“ECE re S(O] res 


= +k—1\(n-—k— '); 
_ mt+n—1—lyl 
=? 4 z("* k \ J 


now use No. 46. Note that the equality relates a binomial distribution 
to a negative binomial distribution. | 


Chapter 7 


Poisson and Normal Distributions 


7.1. Models for Poisson distribution 


The Poisson distribution is of great importance in theory and in practice. 
It has the added virtue of being a simple mathematical object. We could have 
introduced it at an earlier stage in the book, and the reader was alerted to this 
in §4.4. However, the belated entrance will give it more prominence, as well 
as a more thorough discussion than would be possible without the benefit of 
the last two chapters. 

Fix a real positive number a and consider the probability distribution 
{ax, k € N°}, where N° is the set of all nonnegative integers, given by 


ee 
~ kL 


ar, 


(7.1.1) ay, 
We must first verify that 


rr) ) ak 
~ a=e«> mm eer = 1 


where we have used the Taylor series of e*. Let us compute its mean as well: 


x k 3 ak 00 ak-1 
a, = e k— = e-%q a 
foo) 
Od 
— = e@ “qget® = q, 
=0 K! 


e “a 
k 


[This little summation has been spelled out since I have found that students 
often do not learn such problems of “infinite series” from their calculus 
course.| Thus the parameter a has a very specific meaning indeed. We shall 
call the distribution in (7.1.1) the Poisson distribution with parameter a. It will 
be denoted by z(a), and the term with subscript k by 7,(a). Thus if X is a 
random variable having this distribution, then 


P(X = k) = m(a) = F ak, k END; 
and 


(7.1.2) E(X) = a. 
192 
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Next, let us find the generating function g as defined in §6.5. We have, 
using Taylor’s series for e*? this time: 


[--e] foe) k 
(7.1.3) g(z) = DO apzt = e-* zt = emaeae = gale, 
k=0 n=o0 k! 


This is a simple function and can be put to good use in calculations. If we 
differentiate it twice, we get 


g'(z) = aer2-)), g’"(z) = gree). 
Hence by (6.5.6), 


E(X) = g’) =a 
(7.1.4) E(X*) = g'(1) + 91) = a + 2, 
a(X) = a. 


So the variance as well as the mean is equal to the parameter a (see below 
for an explanation). 

Mathematically, the Poisson distribution can be derived in a number of 
significant ways. One of these is a limiting scheme via the binomial distribu- 
tion. This is known historically as Poisson’s limit law, and will be discussed 
first. Another way, that of adding exponentially distributed random variables, 
is the main topic of the next section. 

Recall the binomial distribution B(n; p) in §4.4 and write 


(7.1.5) Bi(n;p) = (7) pK = pyr, OS kn. 
We shall allow p to vary with n; this means only that we put p = p, in the 


above. Specifically, we take 


(7.1.6) Pn = -? n> 1. 


We are therefore considering the sequence of binomial distributions B(n; a/n), 
a typical term of which is given by 


(7.1.7) B, ( :*) ~ (;) (<) (1 _ “ O<k<n. 


For brevity let us denote this by b,(n). Now fix k and let n go to infinity. It 
turns out that b,(m) converges for every k, and can be calculated as follows. 
To begin at the beginning, take k = 0: then we have 


(7.1.8) lim b(n) = lim (1 — “)" = e7%, 
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This is one of the fundamental formulas for the exponential function which 
you ought to remember from calculus. An easy way to see it is to take natural 


logarithm and use the Taylor series log (1 — x) = — > x"/n: 
n=] 


lI 
Pa 
Oo 
g@ 
nr 
—" 
| 
j 
NS 
Il 
Pa 
Ss 
| 
SiR 
| 
we) 
2 | RQ. 
| 
VY 


(7.1.9) log (1 - “)" 


When n— oo the last-written quantity converges to —a which is log e~*. 
Hence (7.1.8) may be verified by taking logarithms and expanding into power 
series, a method very much in use in applied mathematics. A rigorous proof 
must show that the three dots at the end of (7.1.9) above can indeed be over- 
looked; see Exercise 8. 

To proceed, we take the ratio of consecutive terms in (7.1.7): 


me HM) = LGA] 
b(n) k+1\n nJ =k+1 n n 

The two factors within the square brackets above both converge to 1 as 
n—o, hence 


. brain) _ a ; 
(7.1.10) tim An) R41 


Starting with (7.1.8), and using (7.1.10) for k = 0, 1, 2,..., we obtain 


lim 5,(n) = 1 lim b(n) = ae-*, 


2 
lim b(n) = 5 lim b(n) = 75 e, 
: Ol as ak 
no, ln) = 5 MT Pea) = TRE 


These limit values are the successive terms of x(a). Therefore we have proved 
Poisson’s theorem in its simplest form as follows. 
Poisson’s limit law: 


lim B, @ “) = m(a), kEN® 


ne 


This result remains true if the a/n on the left side above is replaced by a,/n, 
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where lim a, = a. In other words, instead of taking p, = a/n as we did in 


(7.1.6), so that np, = a, we may take p, = a,/n, so that np, = a, and 


(7.1.11) lim np, = lim a, = a. 


n— 0 


The derivation is similar to the above except that (7.1.8) is replaced by the 
stronger result below: if lim a, = a, then 


n> 


(7.1.12) lim ( _ ay’ = ea, 


n> 0 


With this improvement, we can now enunciate the theorem in a more prag- 
matic form as follows. A binomial probability B,(n; p), when n is large com- 
pared with np which is nearly a, may be approximated by ;(a), for modest 
values of k. Recall that np is the mean of B(n; p) (see §4.4); it is no surprise 
that its approximate value a should also be the mean of the approximate 
Poisson distribution, as we have seen under (7.1.2). Similarly, the variance of 


B(n; p) 1s npq = n= (1 — “) for p= "3 as n—o the limit is also a as 


remarked under (7.1.4). 

The mathematical introduction of the Poisson distribution is thus done. 
The limiting passage from the binomial scheme is quite elementary, in con- 
trast to what will be done in §7.3 below. But does the condition (7.1.6), or 
the more relaxed (7.1.11), make sense in any real situation? The astonishing 
thing here is that a great variety of natural and man-made random phe- 
nomena are found to fit the pattern nicely. We give four examples to illus- 
trate ways in which the scheme works to a greater or lesser degree. 


Example 1. Consider a rare event, namely one with small probability p of 
occurrence. For instance if one bets on a single number at roulette, the prob- 
ability of winning is equal to 1/37 ~ .027, assuming that the 36 numbers 
and one “zero”’ are equally likely. [The roulette wheels in Monte Carlo have 
a single “‘zero,’’ but those in Las Vegas have “double zeros.’’] If one does this 
37 times, he can ‘“‘expect’’ to win once. (Which theorem says this?) But we 
can also compute the probabilities that he wins no time, once, twice, etc. The 
exact answers are of course given by the first three terms of B(37; 1/37): 
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If we set 


then the three numbers above are: 


3737 l. 
Cy 36° 36% 5° 


Hence if we use the approximation e—!  .368 for c, committing thereby an 


error of 1.5%; and furthermore “confound”’ = 
error of 3%, but in the opposite direction; we get the first three terms of 7(1), 


namely: 


with 1, committing another 


l 
e—! el, = e7 1, 
3 3 y) 


Further errors will be compounded if we go on, but some may balance others. 
We may also choose to bet, say, one hundred eleven times (111 = 37 X 3) 
on a single number, and vary it from time to time as gamblers usually do ata 
roulette table. The same sort of approximation will then yield 


1 

a (1 - i) 7 ST 308 my 30-4 
111 X 110 (1 _ ii) _ Hl x 0 Lawes 

2 37 36 X 360 2 2°” 
etc. Here of course c? is a worse approximation of e~* than c is of e~!. Anyway 
it should be clear that we are simply engaged in more or less crude but handy 
numerical approximations, without going to any limit. For no matter how 
small p is, so long as it is fixed as in this example, np will of course go to 
infinity with n, and the limiting scheme discussed above will be wide of the 
mark when n is large enough. Nevertheless a reasonably good approximation 
can be obtained for values of n and p such that np is relatively small com- 
pared with xn. It is just a case of pure and simple numerical approximation, 
but many such applications have been made to various rare events. In fact 
the Poisson law was very popular at one time under the name of ‘‘the law of 
small numbers.” Well kept statistical data such as the number of Prussian 
cavalry men killed each year by a kick from a horse, or the number of child 
suicides in Prussia, were cited as typical examples of this remarkable dis- 
tribution (see [Keynes]). 
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Example 2. Consider the card-matching problem in §6.2. If a person who 
claims ESP (extrasensory perception) is a fake and is merely trying to match 
the cards at random, will his average score be better or worse when the num- 
ber of cards is increased? Intuitively, two opposite effects are apparent. On 
one hand, there will be more cards to score; on the other, it will be harder to 
score each. As it turns out (see §6.2) these two effects balance each other so 
nicely that the expected number is equal to 1 irrespective of the number of 
cards! Here is an ideal setting for (7.1.6) with a = 1. In fact, we can make it 
conform exactly to the previous scheme by allowing duplication in the 
guessing. That is, if we think of a deck of n cards laid face down on the 
table, we are allowed to guess them one by one with total forgetfulness. Then 
we can guess each card to be any one of the n cards, with equal probability 
1/n, and independently of all other guesses. The probability of exactly k 
matches is then given by (7.1.7) with a = 1, and so the Poisson approxima- 
tions 7;(1) applies if n is large. 

This kind of matching corresponds to sampling with replacement. It is 
not a realistic model when two decks of cards are matched against each 
other. There is then mutual dependence between the various guesses and the 
binomial distribution above of course does not apply. But it can be shown 
that when n is large the effect of dependence is small, as follows. Let the 
probability of “no match” be g, when there are n cards to be matched. We 
see in Example 4 of §6.2 that 


Qn ~~ eT} 


is an excellent approximation even for moderate values of n. Now an easy 
combinatorial argument (Exercise 19) shows that the probability of exactly k 
matches is equal to 


n\ 1 J 
(7.1.13) (;) (ny, = Hy In—k- 


Hence for fixed k, this converges to a el = 7,(1). 


Example 3. The Poisson law in a spatial distribution is typified by the count- 
ing of particles in a sort of “homogeneous chaos.” For instance we may 
count the number of virus particles with a square grid under the microscope. 
Suppose that the average number per small square is » and that there are N 
squares in the grid. The virus moves freely about in such a way that its dis- 
tribution over the grid may be approximated by the “tokens in boxes’? model 
described under (/’) in §3.3. Namely, there are uN particles to be placed into 
the N squares, and each particle can go into any of the squares with prob- 
ability 1/N, independently of each other. Then the probability of finding ex- 
actly k particles in a given square is given by the binomial distribution: 
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k uN—k 
Bs (wy) = (N(x) (0) 
Now we should imagine that the virus specimen under examination is part of 
a much larger specimen with the same average spatial proportion yu. In prac- 
tice, this assumption is 
reasonably correct when 
for example alittle blood 
is drawn from a sick 
body. It is then legiti- 
mate to approximate 
the above probability by 
mu) When N is large. 
The point here is that 
the small squares in 
which the counts are 
made remain fixed in 
size, but the homogen- 
eity of space permits a 
limiting passage when 
the number of such 
squares is multiplied. 
Figure 26 A grim example of 
the spatial scheme is fur- 
nished by the counting of flying-bomb hits on the south of London during 
World War II. The area was divided into N = 576 squares each of 1/4 
square mile, and u was found statistically to be about .930. The table be- 
low shows the actual counts N, and the Poisson approximations z;(u) with 
pu = .9323. The close fit in this case might be explained by the deliberate 
randomness of the attacks which justified the binomial model above. 


Ss | EE | SE | SiN ES Fea anetremremmemneewrernnctiit we enetems feet 


N, | 229 211 93 35 7 I 
Tk 226.74 | 211.39 | 98.54 | 30.62 | 7.14 | 1.59 


Example 4. In a large class of applications, time plays the role of space in 
the preceding example. If random occurrences are distributed over a period 
of time in such a way that their number per unit time may be supposed to be 
fairly constant over the period, then the Poisson scheme will operate with 
time acting as the medium for the homogeneous chaos. One could repeat the 
multiplication argument in Example 3 with time substituting for space, but 
here it is perhaps more plausible to subdivide the time. Suppose for example 
some cosmic ray impacts are registered on a geiger counter at the average 
rate of a per second. Then the probability of a register in a small time interval 
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§ is given by aé + o(6), where the “‘little-o”’ term represents an error term 
which is of smaller order of magnitude than 6, or roughly, “‘very small.’’ 
Now divide the time interval [0, ¢] into N equal parts, so that the probability 
of a counter register in each subinterval is 


at t 
wt? (x) 


with 6 = t/N above. Of course at/N is much smaller than 1 when t is fixed 
and N is large. Let us first assume that for large enough values of N, the prob- 
ability of more than one register in any small subinterval may be neglected, 
so that we may suppose that the number of impacts received in each of the N 
subintervals is either 0 or 1. These numbers can then be treated as Bernoullian 
random variables taking the values 0 and 1 with probabilities 1 — + and + 
respectively. Finally we assume that they are independent of each other. This 
assumption can be justified on empirical grounds; for a deeper analysis in 
terms of the Poisson process, see the next section. Under these assumptions 
it is now clear that the probability of receiving exactly k impacts in the entire 


period [0, ¢] is given by the binomial B, (w i) ;1n fact the total number reg- 


istered in [0, ¢] is just the sum of N independent Bernoullian random variables 
described above. (See Example 9 of §4.4.) Since N is at our disposal and may 
be made arbitrarily large, in the limit we get 1,(at). Thus in this case the 
validity of the Poisson scheme may be attributed to the infinite subdivisibility 
of time. The basic assumption concerning the independence of actions in 
disjoint subintervals will be justified in Theorem 2 of the following section. 


7.2.* Poisson process 


For a deeper understanding of the Poisson distribution we will construct a 
model in which it takes its proper place. The model is known as Poisson 
process and is a fundamental stochastic process. 

Consider a sequence of independent positive random variables all of 
which have the exponential density ae-*', a > 0; see Example 3 of §6.2. Let 
them be denoted by 7}, T:,. . . so that for each j, 


(7.2.1) PT, < t) = 1 — e*, PIT; > t) = et, t > 0. 
Since they are independent, we have for any nonnegative th, ..., tn: 


Pq > hh. ° .,ln > tr) = P(T; > ty) -+» P(T, > tn) 
= g-aht +++ +t0), 


This determines the joint distribution of the T,’s, although we have given the 
“tail probabilities” for obvious simplicity. Examples of such random variables 
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have been discussed before. For instance, they may be the inter-arrival times 
between vehicles in a traffic flow, or between claims received by an insurance 
company (see Example 5 of §4.2). They can also be the durations of succes- 
sive telephone calls, or sojourn times of atoms at a specific energy level. 
Since 

I 


am 


(7.2.2) E(T,) = 


it is clear that the smaller a is, the longer the average inter-arrival, or waiting, 
or holding time. For instance if T is the inter-arrival time between automobiles 
at a check point, then the corresponding a must be much larger on a Los 
Angeles freeway than in a Nevada desert. In this particular case a is also 
Known as the intensity of the flow, in the sense that heavier traffic means a 
higher intensity, as every driver knows from his nerves. 

Now let us put S) = 0 and forn > 1: 


(7.2.3) S, = Ty +++) +Th. 


Then by definition S, is the waiting time till the mth arrival; and the event 
{S, < ¢} means that the nth arrival occurs before time ¢. [We shall use the 
preposition “‘before’’ loosely to mean “‘before or at” (time #). The difference 
can often be overlooked in continuous time models but must be observed in 
discrete time.] Equivalently, this means ‘‘the total number of arrivals in the 
time interval [0, ¢]’’ is at least n. This kind of dual point of view is very useful, 
so we will denote the number just introduced by M(t). We can then record 
the assertion as follows: 


(7.2.4) {N(t) > n} = {S, < fh. 


Like S,, N(t) is also a random variable: M(t, w) with the w omitted from the 
notation as in T,(w). If you still remember our general discussion of random 
variables as functions of a sample point w, now is a good time to review the 
situation. What is w here? Just as in the examples of §4.2, each w may be 
regarded as a possible record of the traffic flow or insurance claims or tele- 
phone service or nuclear transition. More precisely, M(t) is determined by 
the whole sequence {7T,, 7 > 1}, and depends on w through the 7;,’s. In fact, 
taking differences of both sides in the equations (7.2.4) for n and n + 1, we 
obtain 


(7.2.5) {Mt =n} = {8, < oO -— {Sau < = (8S, St < Siu}. 


The meaning of this new equation is clear from a direct interpretation: there 
are exactly n arrivals in [0, ¢] if and only if the nth arrival occurs before ¢ but 
the (n + 1)st occurs after ¢. For each value of 1, the probability distribution 
of the random variable N(¢) is therefore given by 


7.2. Poisson process 201 
(7.2.6) P{N(t) = n} = P{S, < t} — P{Sru < , neE N* 


Observe the use of our convention Sy = 0 in the above. We proceed to show 
that this is the Poisson distribution z(at). 

We shall calculate the probability P{S, < ¢} via the Laplace transform 
of S, (see §6.5). The first step is to find the Laplace transform L(\) of each T;, 
which is defined since T; > 0. By (6.5.13) with f(u) = ae-@“, we have 


— ° —hU py o--aU = _o 

(7.2.7) L(A) [ e qe dy eo 

Since the 7;’s are independent, an application of Theorem 7 of §6.5 yields the 
Laplace transform of S,: 


(7.2.8) LQ)" = € ¢ J" 


To get the distribution or density function of S, from its Laplace transform 
is called an inversion problem; and there are tables of common Laplace 
transforms from which you can look up the inverse, namely the distribution 
or density associated with it. In the present case the answer has been indicated 
in Exercise 38 of Chapter 6. However here is a trick which leads to it quickly. 
The basic formula is 


(7.2.9) | et dt = , x>0. 
0 


Differentiating both sides n times, which is easy to do, we obtain 


or 


_ i ° n—Ip—xt = i 
(7.2.10) a > |, t"—le-*! dt = xn 


Substituting a + \ for x in the above and multiplying both sides by a”, we 
deduce 


° a" n—lp—aup—ru —_— a ia 
i aD!“ e-aug—hu dy (4) 


Thus if we put 


(7.2.11) fru) = aI yrle-au, 
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we see that f, is the density function for the Laplace transform in (7.2.8), 
namely that of S,.+ Hence we can rewrite the right side of (7.2.6) explicitly as 


(7.2.12) f ‘ £,(u) du — i  Feas(u) du 


To simplify this we integrate the first integral by parts as indicated below: 


n t 
_ oe —au,n—1 _ a” 
nl ut du = Gr =i 


qntl 
ae 7 tren 4% ee oho ure au au. 
0 


Ary au} 


But the last-written integral is just the second integral in (7.2.12); hence the 
difference there is precisely oT t"e- = 7,(at). For fixed n and a, this is the 


density function of the gamma distribution ['(n; a); see p. 189. Let us record 
this as a theorem. 


Theorem 1. The total number of arrivals in a time interval of length t has the 
Poisson distribution r(at), for each t > 0. 


The reader should observe that the theorem asserts more than has been 
proved. For in our formulation above we have implicitly chosen an initial 
instant from which time is measured, namely the zero-time for the first 
arrival time T,. Thus the result was proved only for the total number of 
arrivals in the interval [0, t]. Now let us denote the number of arrivals in an 
arbitrary time interval [s, s + t] by N(s, s + £). Then it is obvious that 


Ms, s+ t) = Ms+t)— Ms) 


in our previous notation, and M(0) = 0. But we have yet to show that the 
distribution of N(s, s + £) is the same as N(0, 4). The question becomes: if 
we Start counting arrivals from time s on, will the same pattern of flow hold 
as from time 0 on? The answer is “‘yes’’ but it involves an essential property 
of the exponential distribution of the 7,’s. Intuitively speaking, if a waiting 
time such as 7, is broken somewhere in between, its duration after the break 
follows the original exponential distribution regardless how long it has 
already endured before the break. This property is sometimes referred to as 
“lack of memory,” and can be put in symbols: for any s > 0 and ¢ > 0, 
we have 


(7.2.13) PT>t+s|T>s)=PT>bp)=e; 


+ Another derivation of this is contained in Exercise 42 of Chapter 6. 
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see Example 4 of §5.1. There is a converse: if a nonnegative random variable 
T satisfies the condition above, then it must have an exponential distribution ; 
see Exercise 41 of Chapter 5. Thus the lack of memory is characteristic of an 
exponential inter-arrival time. 

We can now argue that the pattern of flow from time s on is the same as 
from O on. For the given instant s breaks one of the inter-arrival times, say 
T,, into two stretches as shown below: 


Tt oT 
Ss 
0) ee teat 
ee gle ~~ eee ae 
T; T> T3 T; T41 


Figure 27 


According to the above, the second stretch T;’ of the broken 7;, has the same 
distribution as 7; and it is clearly independent of all the succeeding 77,41, 
Ty.42,.... [The clarity is intuitive enough, but a formal proof takes some 
doing and is omitted.] Hence the new shifted inter-arrival times from s 
onward: 


(7.2.14) TH, Tuas, Trad - - - 


follow the same probability pattern as the original inter-arrival times be- 
ginning at 0: 


(7.2.15) Ti, Tx, Tay. . - 


Therefore our previous analysis applies to the shifted flow as well as the 
original one. In particular the number of arrivals in [s, s + ¢] must have the 
same distribution as that in [0, ¢]. This is the assertion of Theorem 1. 

The fact that N(s, s + rt) has the same distribution for all s is referred to 
as the time-homogeneity of the flow. Let us remember that this is shown 
under the assumption that the intensity a is constant for all time. In practice 
such an assumption is tenable only over specified periods of time. For example 
in the case of traffic flow on a given highway, it may be assumed for the rush 
hour or from 2 a.m. to 3 a.m. with different values of a. However, for longer 
periods of time such as one day, an average value of a over 24 hours may be 
used. This may again vary from year to year, even week to week. 

So far we have studied the number of arrivals in one period of time, of 
arbitrary length and origin. For a more complete analysis of the flow we 
must consider several such periods and their mutual dependence. In other 
words, we want to find the joint distribution of 
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(7.2.16) N(S1, 51 + t1), N(S2, So + te), N(s3, 53 + t3),... 
etc. The answer is given in the next theorem. 


Theorem 2. If the intervals (s,, 5; + t1), (Se, S82 + t),... are disjoint, then the 
random variables in (7.2.16) are independent and have the Poisson distributions 


w(t), w(tr), .... 


It is reasonable and correct to think that if we know the joint action of N 
over any arbitrary finite set of disjoint time intervals, then we know all about 
it in principle. Hence with Theorem 2 we shall be in full control of the process 
in question. 

The proof of Theorem 2 depends again on the lack-of-memory property 
of the T,’s. We will indicate the main idea here without going into formal 
details. Going back to the sequence in (7.2,14), where we put s = %, we now 
make the further observation that all the random variables there are not 
only independent of one another, but also of all those which precede s, 
namely: 


(7.2.17) T;,..., Tea, Th. 


The fact that the two broken stretches 7; and T;’ are independent is a con- 
sequence of (7.2.13), whereas the independence of all the rest should be 
intuitively obvious because they have not been disturbed by the break at s. 
[Again, it takes some work to justify the intuition.] Now the “past history” 
of the flow up to time s is determined by the sequence in (7.2.17), while its 
“future development” after s is determined by the sequence in (7.2.14). 
Therefore relative to the “present”? s, past and future are independent. In 
particular, N(s;, s; + t:) which is part of the past, must be independent of 
N(S2, Sz + te), N(S3, S3 + t3),..., Which are all part of the future. Repeating 
this argument for s = 53, s4,..., the assertion of Theorem 2 follows. 


Figure 28 


We are now ready to give a general definition for the ‘‘flow’ we have 
been discussing all along. 


Definition of Poisson Process. A family of random variables {X(t)}, indexed 
by the continuous variable ¢ ranging over [0, ©), is called a Poisson process 
with parameter (or mean) a iff it satisfies the following conditions: 
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(i) X() = 0; 
(i1) the increments X(s; + t,) — X(s,), over an arbitrary finite set of dis- 
joint intervals (s,, s; + t,), are independent random variables; 
(ii) for each s > 0, t > 0, X (s + t) — X(s) has the Poisson distribution 
m(at). 


According to Theorems | and 2 above, the family {N(t), t > 0} satisfies 
these conditions and therefore forms a Poisson process. Conversely, it can 
be shown that every Poisson process is representable as the M(t) above. 

The concept of a stochastic process has already been mentioned in 
§§5.3-5.4, in connection with Pélya’s urn model. The sequence {X,,n > 1} 
in Theorem 5 of §5.4 may well be called a Pdlya process. In principle a 
stochastic process is just any family of random variables; but this is putting 
matters in an esoteric way. What is involved here goes back to the founda- 
tions of probability theory discussed in Chapters 2, 4 and 5. There is a sample 
space 2 with points w, a probability measure P defined for certain sets of w, 
a family of functions w — X,(w) called random variables, and the process is 
concerned with the joint action or behavior of this family: the marginal and 
joint distributions, the conditional probabilities, the expectations, and so 
forth. Everything we have discussed (and are going to discuss) may be re- 
garded as questions in stochastic processes, for in its full generality the term 
encompasses any random variable or sample set (via its indicator). But in its 
customary usage we mean a rather numerous and well organized family 
governed by significant and useful laws. The preceding characterization of 
a Poisson process is a good example of this description. 

As defined, w — M(t, w) is a random variable for each ¢, with the Poisson 
distribution (at). There is a dual point of view which is equally important 
in the study of a process, and that is the function t-—> M(t, w) for each w. 
Such a function is called a sample function (path or trajectory). For example 
in the case of telephone calls, to choose a sample point w may mean to pick 
a day’s record of the actual counts at a switch-board over a 24-hour period. 
This of course varies from day to day so the function t — M(t, w) gives only 
a sample (denoted by w) of the telephone service. Its graph may look like this: 
the points of jumps are the successive arrival times S,(w), each jump being 
equal to 1, and the horizontal stretches indicate the inter-arrival times. So 
the sample function is a monotonically nondecreasing function which in- 
creases only by jumps of size one and is flat between jumps. Such a graph is 
typical of the sample function of a Poisson process. If the flow is intense then 
the points of jumps are crowded together. 

The sequence {S,, > 1} defined in (7.2.3) is also a stochastic process, 
indexed by n. A sample function n — S,(w) for this process is an increasing 
sequence of positive numbers {S,(w), So(w),..., S,(w),...}. Hence it is 
often called a sample sequence. There is a reciprocal relation between this 
and the sample function shown above. If we interchange the two coordinate 
axes, which can be done by turning the page 90°, and look at Figure 29 
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S1 S2 S3 S4 Ss 


Figure 29 


through the light from the back, we get the graph of n —> S,(w). Ignore the 
now vertical stretches except the lower end points which indicate the values 
of S,. 


The following examples illustrate some of the properties of the Poisson 
distribution and process. 


Example 5. Consider the number of arrivals in two disjoint time intervals: 
Xi = N(s, S1 + ty) and Xo = N(52, So + to) as in (7.2.16). What is the proba- 
bility that the total number X; + X, is equal to n? 

By Theorem 2, X; and X, are independent random variables with the 
distributions z(at,) and (at) respectively. Hence 


PX + X2= m= P(X = PO = b) 
J =n 


e~*4(at,)? e~ “2(aty)* 
jth=n iS! k! 
e~ti +t) in 


x ("\ian(atsy 


n! 


e~ 7h + &) 
nl (at, + at,)” = Aa +- ate). 


Namely, Xi + X2 is also Poissonian with parameter at, + at,. The general 
proposition is as follows. 


Theorem 3. Let X, be independent random variables with Poisson distri- 
butions r(a,), |1<j<n. Then X,+ --- + X, has Poisson distributions 
ma, + +++ + ay). 


This follows from an easy induction, but we can also make speedy use of 
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generating functions. If we denote the generating function of X, by gy,, then 
by Theorem 6 of 6.5: 


Exit --- +xAZ) = Bxiz)gx(z) --- gx,(Z) 
em (2— 1) ga2(z—1) we. e%m(z—1) 


I 


= elat vce +an)(2—1). 


Thus X,+--- +X, has the generating function associated with 
m(a, + --- ++ an), and so by the uniqueness stated in Theorem 7 of §6.5 it 
has the latter as distribution. 


Example 6. At a crossroad of America we watch cars zooming by bearing 
license plates of various states. Assume that the arrival process is Poissonian 
with intensity a, and that the probabilities of each car being from the states 
of California, Nevada and Arizona, are respectively p; = 1/25, p, = 1/100, 
ps = 1/80. In a unit period of time what are the number of cars counted with 
these license plates? 

We are assuming that if m cars are counted the distribution of various 
license plates follows a multinomial distribution M(n; 50; pi, . . . , Pso) where 
the first three p’s are given. Now the number of cars passing in the period of 
time is a random variable N such that 


P(N = n) = ar, n=0,1,2,.... 


Among these N cars, the number being the kth state license is also a random 
variable N,; of course 


N, + No + +++ + Mo = N. 
The problem is to compute 

P(N, = m, No = to, N3 = ns) 
for arbitrary m, m, n3. Letg = 1 — pi — po — ps; this is the probability of a 
license plate not being from one of the three indicated states. For a given n, 


the conditional probability under the hypothesis that N = n is given by the 
multinomial; hence 


yar _ 7 _ nt pi'psp3'q* 
PCM = th, No = Ma, Na = n|N = 0) = 7 


where k = n — nm, — no — ny. Using the formula for “total probability”’ 
(5.2.3), we get 
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P(N, = Nh, Ne. = Mo, N3 = N3) 


bo 


4 


> P(N = n)P(N, = Nh, Ny = No, N3 = n3|N = n) 
n=Q0Q 


en Mt pipe ph gt 
,zo Nn! n,!ny!ns! k! 


Since 1, + n, + ng is fixed andn > n, + ny + nz the summation above reduces 
to that with k ranging over all nonnegative integers. Now write in the above 


ec = e— a(Pit Prt Ps)e—ag, a” — Qutnet nak 


and take out the factors which do not involve the index of summation k. 
The result is 


e~ «(pit pst Pa) 0 enay , 

—nimin! (ap,)""(ap2)""(aps)™ a, Fey (ogy = tal oPr rn apr) tn aps) 
since the last-written sum equals one. Thus the random variables M,, 
N», N3 are independent (why?) and have the Poisson distributions 7(a@p,), 
w(ap2), ™(aps). 

The substitution of the “fixed number xn” in the multinomial 
M(n;r; pi, ..., pr) by the “random number N”’ having a Poisson distribution 
is called in statistical methodology “randomized sampling.” In the example 
here the difference is illustrated by either counting a fixed number of cars, or 
counting whatever number of cars in a chosen time interval, or by some other 
selection method which allows a chance variation of the number. Which way 
of counting is more appropriate will in general depend on the circumstances 
and the information sought. 

There is of course a general proposition behind the example above which 
may be stated as follows. 


Theorem 4. Under randomized sampling from a multinomial population 
M(n;r; pi, ..., Pr) where the total number sampled is a Poisson random varia- 


ble N with mean a, the numbers Ni, ..., N, of the various varieties obtained by 
the sampling become independent Poisson variables with means api, ..., ar. 


As an illustration of the finer structure of the Poisson process we will 
derive a result concerning the location of the jumps of its sample functions. 
Let us begin with the remark that although (almost) all sample functions 
have infinitely many jumps in (0, ©), the probability that a jump occurs at 
any prescribed instant of time is equal to zero. For if t > 0 is fixed, then as 
6 | 0 we have 


P{N(t + 6) — Mt — 8) > 1} = 1— moa, 26) = 1 — ce? 5 0. 


In particular, the number of jumps in an interval (4), t2) has the same distri- 
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bution whether the end-points f, and tf are included or not. As before let 
us write N(t, t2) for this number. Now suppose N(0, 2) = n for a given f, 
where n > 1, and consider an arbitrary subinterval (f,, 4.) of (0, t). We have 
forO<j<n: 


P{N(h, te) = j; NO, 1) = n} = P{N(h, 2) = 7; NO, 4) + Mb, ) =n — jh. 


Let t2 — 4, = s, then the sum of the lengths of the two intervals (0, t,) and 
(t2, t) is equal to t — s. By property (ii) and Theorem 3 the random variable 
N(O, t1) + Nh, t) has the distribution z(a(t — s)); and is independent of 
N(h, t2). Hence the probability above is equal to 


eas sy e alts) (at — ay"? 
i (n — j)! 


Dividing this by P{N(0, t) = n} = e-* eo “; we obtain the conditional 


probability: 


, n\ (s\? s\"—7 
(7.2.18)  P{N(t, t) = j| NO, 1) = n} = (") (5) (1 — *) 
This is the binomial probability B,(n; s/f). 

Now consider an arbitrary partition of (0, 7) into a finite number of 
subintervals 4, ..., J; of lengths s,..., s; so that s, + --- +s; = t. Let 
my, ..., n, be arbitrary nonnegative integers with n, + --- +n, =n. If we 
denote by N(/,) the number of jumps of the process in the interval J,, then 
we have by a calculation similar to the above: 


P{NUk) = m, 1 < k <1| NO, 1) = n} 


(7.2.19) = y oes” (< ery 


k=1 ny! n! 


n! l Si \™ 
ta (2)" 
Ny +: Ny Za1 LL 


This is the multinomial distribution discussed in §6.4. 

Let us pick v points at random in (0, f) and arrange them in nondecreasing 
orderO<&<&<--- < &, < t. Using the notation above let NU) denote 
the number of these points lying in J,. It is not hard to see that the n-dimen- 
sional distribution of (&, . . . , &,) is uniquely determined by the distribution of 
(Nh), ..-, N(1))) for all possible partitions of (0, ¢); for a rigorous proof of 
this fact see Exercise 26 below. In particular, if the n points are picked inde- 
pendently of one another and each is uniformly distributed in (0, £), then it fol- 
lows from the discussion in §6.4 that the probability P{N(U;,) = m,1<k<} 


210 Poisson and Normal Distributions 


is given by the last term in (7.2.19). Therefore, under the hypothesis that 
there are exactly n jumps of the Poisson process in (0, f), the conditional dis- 
tribution of the n points of jump is the same as if they are picked in the manner 
just described. This has been described as a sort of ‘homogeneous chaos.” 


7.3. From binomial to normal 


From the point of view of approximating the binomial distribution B(n; p) for 
large values of n, the case discussed in §7.1 leading to the Poisson distribution 
is abnormal, because p has to be so small that np remains constant, or nearly 
so. The fact that many random phenomena follow this law rather nicely was 
not known in the early history of probability. One must remember that not 
only radioactivity had yet to be discovered, but neither the telephone nor 
automobile traffic existed as modern problems. On the other hand counting 
heads by tossing coins or points by rolling dice, and the measurement of all 
kinds of physical and biological quantities were already done extensively. 
These led to binomial and multinomial distributions, and since computing 
machines were not available it became imperative to find manageable for- 
mulas for the probabilities. The normal way to approximate the binomial 
distribution: 


(73.1) Buin; p) = (Z) PM = pyr, OS KS 


is for a fixed value of p and large values of n. To illustrate by the simplest 
kind of example, suppose an unbiased coin is tossed 100 times; what is the 
probability of obtaining exactly 50 heads? The answer 


ey 1 100! 1 
50 / 2190 50150! 2100 
gives little satisfaction as we have no idea of the magnitude of this probability. 
Without some advanced mathematics (which will be developed presently), 
who can guess whether this is near ; or 6 or a? 

Now it is evident that the key to such combinatorial formulas is the fac- 
torial n! which just crops up everywhere. Take a look back at Chapter 3. So 
the problem is to find a handy formula: another function x(n) of n which is a 
good approximation for n! but of a simpler structure for computations. But 
what is “good’’? Since n! increases very rapidly with n (see the short table 
in §3.2) it would be hopeless to make the difference |n! — x(n)| small. [Does 
it really make a difference to have a million dollars or a million and three? ] 
What counts is the ratio n!/x(n) which should be close to 1. For two positive 
functions yw and 9 of the integer variable n, there is a standard notation 
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(7.3.2) ¥(n) ~ x(n) which means lim oie 1. 


We say also that ¥(n) and x(n) are asymptotically equal (or equivalent) as 
n—» oo, If so we have also 


lim (nm) — x(n)| = 0 
n> x(n) 


provided x(n) > 0 for large n; thus the difference |¥(n) — x(n)| is negligible 
in comparison with x(n) or ¥(n), though it may be large indeed in absolute 
terms. Here is a trivial example which you should have retained from a 
calculus course (under the misleading heading “indeterminate form’’): 


¥(n) = 2n? + 10n — 100, x(n) = 2n?. 
More generally, a polynomial in 7 is asymptotically equal to its highest term. 
Here, of course, we are dealing with something far more difficult: to find a 


simple enough x(n) such that 


ni 
lim —~ = I. 
n— x(n) 


Such a x is given by Stirling’s formula (see Appendix 2): 
(7.3.3) x(n) = (“)" V 2a = ntDe-nW/ Qe 


or more precisely 


(7.3.4) n! = (2)" V2nn e*, where —_! < w(n) < = 
e ] 12 


You may think x() is uglier looking than n!, but it is much easier to compute 
because powers are easy to compute. Here we will apply it at once to the 
little problem above. It does not pay to get involved in numericals at the 
beginning so we will consider 


(7.3.5) (*”) 1 _ Qn! 1 


Substituting x() and x(2n) for n! and (2n)! respectively, we see that this is 
asymptotically equal to 
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2n — 
(*) Varn 
Aey A 
(2)" den D2n WV «rn 
e 
In particular for n = 50 we get the desired answer 1/507 = .08 approxi- 


mately. Try to do this by using logarithms on 


(100)s0 1 
50! 2100 


and you will appreciate Stirling’s formula more. We proceed at once to the 
slightly more general 


2n 1 (Cn)! 1 
(7.3.6) (, + i) 22" (n+ k)\(n — k)! 22 


where k is fixed. A similar application of (7.3.3) yields 


(EN Venta +B ("AV Vix — 


4 


= (ER) ay Vee 


Clearly the last-written factor is asymptotically equal to 1/V an. As for the 
two preceding ones, it follows from (7.1.8) that 


. n nth . — kK \ - 
lim (—2 3) = lim (1 4) 7 es 


(7.3.7) nn \nnk ko \n-k 
lim (. — i) ~ lim (1 +-“5) = &. 


exactly as in (7.3.5) which is the particular case k = 0. 
As a consequence, for any fixed number /, we have 
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, 2n 1 
(7.3.8) lim ZI, + i) an — 9 


because each term in the sum has limit 0 as n — © as just shown, and there 
are only a fixed number of terms. Now if we remember Pascal’s triangle 


(3.3.5), the binomial coefficients ( 2 


n . x) —n <k <n, assume their maxi- 


mum value (*”) for the middle term k = 0 and decreases as |k| increases 


(see Exercise 6 below). According to (7.3.8), the sum of a fixed number of 
terms centered around the middle term approaches zero, hence a fortiori the 
sum of any fixed number of terms will also approach zero, namely for any 
fixed a and b with a < b, we have 


b 
lim > (*") sas = 0, 


Nn © j=a J 


Finally this result remains true if we replace 2n by 2n + 1 above, because the 
ratio of corresponding terms 


(+1) 1 /(?)\ d= 2n+1 1 
j J2n+1 j 227 Wn+1-—j 9) 


approaches 1/2 which does not affect the zero limit. Now let us return to the 
probability meaning of the terms, and denote as usual by S, the number of 
heads obtained in 7 tosses of the coin. The result then asserts that for any 
fixed numbers a and b, we have 


(7.3.9) lim P(a < S, < b) = 0. 

Observe that there are n + 1 possible values for S,, whereas if the range 
[a, b] is fixed irrespective of n, it will constitute a negligible fraction of n 
when n is large. Thus the result (7.3.9) is hardly surprising, though certainly 
disappointing. 

It is clear that in order to ‘‘catch” a sufficient number of possible values 
of S, to yield a non-zero limit probability, the range allowed must increase 
to infinity with n. Since we saw that the terms near the middle are of the order 
of magnitude 1/Wn, it is plausible that the number of terms needed will be 
of the order of magnitude Vn. More precisely, it turns out that for each /, 


(7.3.10)  P (5 —~l/n<S,< 5+ vn) - > (") 1 


_\ yj) Qn 
s-aeva 2 * 


will have a limit strictly between 0 and 1 as n ©. Here the range for S,, is 
centered around the middle value n/2 and contains about 2/Vn terms. When 
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n is large this is still only a very small fraction of n, but it increases just 


rapidly enough to serve our purpose. The choice of Vn rather than say n/ 
or n*/5 is crucial and is determined by a rather deep bit of mathematical 
analysis which we proceed to explain. 

Up to here we have considered the case p = 1/2 in (7.3.1), in order to 
bring out the essential features in the simplest case. However, this simplifica- 
tion would obscure the role of mp and npg in the general formula below. The 
reader is advised to carry out the following calculations by himself in the 
easier case p = g = 1/2 to obtain some practice and confidence in such 
calculations. 


Theorem 5. Suppose 0 < p < 1; putq = 1 — p, and 


(7.3.11) Xnk = k— np, O<k<n. 


Clearly x,, depends on both n and k, but it will be written as x, below. 


Let A be an arbitrary but fixed positive constant. Then in the range of k 
such that 


(7.3.12) lxz| < A, 

we have 

7.3.13 @ kgn—k ~w ent ,/2, 
( ) J P49 Vanna 


The convergence is uniform with respect to k in the range specified above. 


Proof: We have from (7.3.11), 

(7.3.14) k = np + Vnpq xz, n—k =ng— Vnpq Xz. 
Hence in the range indicated in (7.3.12), 

(7.3.15) k~np, n—k~ ng. 


Using Stirling’s formula (7.3.3) we may write the left member of (7.3.13) as 


(7.3.16) (<) V 2k (" 2 “\r" V2n(n — k) 


e 


n l 
= VaR g(n, k) ~ Vinnpq y(n, k) 


by (7.3.15), where 
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_ (™P\" (_ng \"™* 
Ant) = (2) (a) 
Taking logarithms and using the Taylor series 
x? x” 
log (1 + x)= x— at see + (- It + rey |x| < 1; 
we have by (7.3.14) 
np\* _ _ Vinpg x 
log (2) = k tog (1 ke ) 


_k (-\ "ees _ MPQXk ); 
(7.3.17) 


log (. wt a) = (n — k) log (1 + vn0a 3) 


Vnpg x. _npqxt 
0 Oe sgt); 


provided that 


Vv npq Xk 


(7.3.17') t 


<1 and 


These conditions are satisfied for sufficiently large value of n, in view of 
(7.3.12) and (7.3.15). Adding the two series expansions above whereupon 
the first terms cancel out each other obligingly, ignoring the dots but using 
“~”? instead of “=,” we obtain 


GX MPG MDgxk 

log o(, k) 2k  An—k)  2k(n—k) 

In Appendix 2 we will give a rigorous demonstration of this relation. Using 
(7.3.15) again, we see that 


_ PGX: Xk 
(7.3.18) log y(n, k) Inpng = 2 


In view of (7.3.12) [why do we need this reminder?], this is equivalent to 
y(n, k) ~ e-*i/2, 
Going back to (7.3.16) we obtain (7.3.13). 


Theorem 6 (De Moivre-Laplace Theorem). For any two constants a and b, 
—o7 <a<b< +, we have 
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b 

(7.3.19) lim P (< < Sn < 6) - | e-2*/2 dx, 
nw V npq V 20 Ja 

Proof: Let k denote a possible value of S, so that S, =k means 


(S, — np)/V. npq = x, by the transformation (7.3.11). Hence the probability 
on the left side of (7.3.19) is just 


~ PS,=kK)= DY ({) p*qr*. 


a<xz <b a<xz <b 


Substituting for each term its asymptotic value given in (7.3.13), and observ- 
ing from (7.3.11) that 


Xiq — Xe 
Vnpq 

we obtain 
l 


VJ dar a<xz<b 


(7.3.20) 


e-74/2(xp41 — Xy)e 


The correspondence between k and x; is one-to-one and when k varies from 
0 to n, x, varies in the interval [—V. np/d, Vv nq/p), not continuously but by 
an increment X:41 — x, = 1/V npq. For large enough n the interval contains 
the given (a, b] and the points x, falling inside (a, b] form a partition of it into 
equal subintervals of length 1/V npq. Suppose the smallest and greatest values 
of k satisfying the condition a < x, < b are j and J, then we have 


Xi SAK xX, SM Xa <-e < Xr < HS Ob < Xr 


and the sum in (7.3.20) may be written as follows: 


l 
(7.3.21) > o(xn)(Xe41 — Xz); Where g(x) = I e~ 2/2, 
k=3 V 24 


This is a Riemann sum for the definite integral f ° v(x) dx, although in stand- 


ard textbook treatments of Riemann integration the endpoints a and b are 
usually included as points of partition. But this makes no difference as n — 
and the partition becomes finer, so the sum above converges to the integral as 
shown in (7.3.19). 

The result in (7.3.20) is called the De Moivre-Laplace Theorem [Abraham 
De Moivre (1667-1754), considered as successor to Newton, gave this result 
in his Doctrine of Chances (1714). Apparently he had priority over Stirling 
(1692-1770) for the formula named after the latter. Laplace extended it and 
realized its importance in his monumental Théorie Analytique des Probabilités 
(1812)]. It was the first known particular case of The Central Limit Theorem 
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to be discussed in the next section. It solves the problem of approximation 
stated at the beginning of the section. The limit on the right of (7.3.20) in- 
volves a new probability distribution to be discussed in the next section. 
Simple examples of application will be given at the end of §7.5 and among 
the exercises. 


7.4. Normal distribution 


The probability distribution with the ¢ in (7.3.21) as density function will 
now be formally introduced: 


P(x) = Sef e~/2 du, (Xx) = ane 


It 1s called the normal distribution, also the Laplace-Gauss distribution; and 
sometimes the prefix unit is attached to distinguish it from a whole family of 
normal distributions derived by a linear transformation of the variable x; see 
below. But we have yet to show that ¢ is a true probability density as defined 
in §4.5, namely that 


(7.4.1) [. v(x) dx = 1. 


A heuristic proof of this fact may be obtained by setting a = —w, b = + 
in (7.3.19), whereupon the probability on the left side certainly becomes 1. 
Why is this not rigorous? Because two (or three) passages to limit are involved 
here which are not necessarily interchangeable. Actually the argument can be 
justified (see Appendix 2); but it may be more important that you should 
convince yourself that a justification is needed at all. This is an instance 
where advanced mathematics separates from the elementary kind we are 
doing mostly in this book. 

A direct proof of (7.4.1) is also very instructive; although it is given in 
most calculus texts we will reproduce it for its sheer ingenuity. The trick is 
to consider the square of the integral in (7.4.1) and convert it to a double 
integral: 


( | —_ o) ax) ( | ~ @) “) 


- o(xo(y) dx dy = ~ exp (—1 2 + y)) dx ay. 
oo Jw ar Jo |—ew 2 


We can then use polar coordinates: 


p= x?+ y*, dxdy = pdp do 
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to evaluate it: 


This establishes (7.4.1) if we take the positive square root. 

The normal density ¢ has many remarkable analytical properties, in fact 
Gauss determined it by selecting a few of them as characteristics of a “law 
of errors.” [Carl Friedrich Gauss (1777~1855) ranked as one of the greatest 
of all mathematicians, also did fundamental work in physics, astronomy and 
geodesy. His major contribution to probability was through his theory of 
errors of observations, known as the method of least squares.] Let us observe 
first that it is a symmetric function of x, namely g(x) = ¢(—x), from which 
the convenient formula follows: 


(7.4.2) i g(u) du = &(x) — &(—x) = 28(x) — 1. 


Next, ¢ has derivatives of all orders, and each derivative is the product of ¢ 
by a polynomial called a Hermite polynomial. The existence of all derivatives 
makes the curve x — (x) very smooth, and it is usually described as “‘bell- 
shaped.” Furthermore as |x| — ©, y(x) decreases to 0 very rapidly. The fol- 
lowing estimate of the tail of ® is often useful: 


I — &x) = | g(u) du < a) = 
To see this, note that —¢’(u) = ug(u), hence 


[ 1- ou) du < / # ou) du = — if o(u) du = > elu) = 2), 


another neat trick. It follows that not only has moments of all orders, but 
even the integral 


(7.4.3) M(6) = [ e%o(x)dx = | - exp ¢ — *] dx 


is finite for every real 6, because e~*’/2 decreases much faster than e!%! in- 
creases as |x|» 0. As a function of 6, M is called the moment generating 
function of ¢ or &. Note that if we replace 6 by the purely imaginary i6, then 


+ See the graph attached to the Table of (x) on pp. 320-321. 
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M(i@) becomes the characteristic function or Fourier transform of ® [see 
(6.5.14)]. The reason why we did not introduce the moment generating func- 
tion in §6.5 is because the integral in (7.4.3) rarely exists if y is replaced by 
an arbitrary density function, but for the normal ¢, M(6) is cleaner than M(i6) 
and serves as well. Let us now calculate M(6). This is done by completing a 
Square in the exponent in (7.4.3): 


xP wey 
772 2 


Now we have 

(7.4.4) M(6) = e/2 [. lx — 6) dx = e/2, 

From this we can derive all the moments of ® by successive differentiation 
of M with respect to @, as in the case of a generating function discussed in 


$6.5. More directly, we may expand the e® in (7.4.3) into its Taylor series in 
@ and compare the result with the Taylor series of e”/? in (7.4.4): 


iz {1+ 0+ OF 4. + N+ ob 69 dx 


G2 1 §2\2 l §2\7 
- 1+ 5 +57(5) +o +57 (5) + 


If we denote the nth moment by m™: 


the above equation becomes 


re) ( ) 
er ee et 
n=o0 Nn! nao 2"n! 


It follows from the uniqueness of power series expansion (cf. §6.5) that the 
corresponding coefficients on both sides must be equal: thus for n > 1: 


m2n—-) — 0, 
(7.4.5) pom 2M! 
Qn! 


Of course the vanishing of all moments of odd order is an immediate conse- 
quence of the symmetry of ¢. 

In general, for any real m and o?, a random variable X is said to have a 
normal distribution N(m, o?) iff the reduced variable 
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has @ as its distribution function. In particular for m = 0 and o? = 1, N(O, 1) 
is just the unit normal ®. The density function of N(m, oc?) is 


(7.4.6) vm - exp (-S") = * (—"). 


This follows from a general proposition (see Exercise 27 of Chapter 6). The 
moment-generating function Mx of X is most conveniently calculated through 
that of X* as follows: 


Mx(0) = E(ebmteX*) = em E(e(®)X*) 


(7.4.7) 
= e™M (oh) = emotes? 


A basic property of the normal family is given below. Cf. the analogous 
Theorem 3 in §7.2 for the Poisson family. 


Theorem 7. Let X, be independent random variables with normal distributions 
N(m,, 03), 1<j<n. Then X,+ +--+ X, has the normal distribution 
n(s m,, >, 3). 

7=1 jg=1 


Proof: It is sufficient to prove this for n = 2, since the general case follows 
by induction. This is easily done by means of the moment-generating function. 
We have by the product theorem as in Theorem 4 of §6.5: 


emg + a6? om + 2262 


Mx,+x(8) = Mx,(0)Mx,(6) 
= e(m+m2)b+ (a1? + 02?)6? 


which is the moment-generating function of N(@m, + my, 1 + 03) by (7.4.7). 
Hence X, + X, has this normal distribution since it is uniquely determined 
by the moment-generating function. [We did not prove this assertion but 
see §6.5. | 


7.5.* Central limit theorem 


We will now return to the De Moivre-Laplace Theorem 6 and give it a more 
general formulation. Recall that 


(7.5.1) S,= Mt: +X, n= l, 


where the X,’s are independent Bernoullian random variables. We know that 
for every /: 
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E(X;) = p, o°(Xj) = Pq; 
and for every n: 
E(S,) = np, o°(Sn) = npq; 
see Example 6 of §6.3. Put 


X, — E(X;,) S, — E(CS,) 1 2 
5.2 yp = 21 BM), ge _ Sn = H(Sn) __ 1S ye 
(7.5.2) ; o( Xj) (Sn) Vn Pe , 


The Szx’s are the random variables appearing in the left member of (7.3.19), 
and are sometimes called the normalized or normed sums. We have for every 
j and n: 


*\) _. 2 *\ . 
(7.5.3) E(X7) = 0, o%(X7) = 1; 
E(Sz) = 0, (Sz) = 1. 

The linear transformation from X, to X; or S, to Sz; amounts to a change 
of origin and scale in the measurement of a random quantity in order to 
reduce its mean to zero and variance to one as shown in (7.5.3). Each Sy is a 
random variable taking the set of values 


This is just the x, in (7.3.11) with the explicit dependence on n indicated. 
The probability distribution of Sz is given by 


P(St = Xnx) = (;) pig, O<k <n. 


It is more convenient to use the corresponding distribution function; call it 
F,, so that 


P(Sn < x) = F(x) —-~7 <x<om., 
Finally, if J is the finite interval (a, b], and F is any distribution function, we 
shall write 
FU) = F(b) — F(a). 
[By now you should understand why we used (a, b] rather than (a, 5) or [a, 5]. 
It makes no difference if Fis continuous, but the F,,’s above are not continuous. 


Of course in the limit the difference disappears in the present case, but it 
cannot be ignored generally.] After these elaborate preparations, we can re- 
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write the De Moivre-Laplace formula in the elegant form below: for any 
finite interval J, 


(7.5.4) lim F,(I) = &(J). 


Thus we see that we are dealing with the convergence of a sequence of dis- 
tribution functions to a given distribution function in a certain sense. 

In this formulation the subject is capable of a tremendous generalization. 
The sequence of distribution functions need not be those of normalized sums, 
the given limit need not be the normal distribution nor even specified in 
advance, and the sense of convergence need not be that specified above. For 
example, the Poisson limit theorem discussed in §7.1 can be viewed as a 
particular instance. The subject matter has been intensively studied in the 
last forty years and 1s still undergoing further evolutions. [For some reference 
books in English see [Feller 2], [Chung 1].] Here we must limit ourselves to 
one such generalization, the so-called Central Limit Theorem in its classical 
setting, which is about the simplest kind of extension of Theorem 6 in §7.2. 
Even so we shall need a powerful tool from more advanced theory which 
we can use but not fully explain. This extension consists in replacing the 
Bernoullian variables above by rather arbitrary ones, as we proceed to 
describe. 

Let {X,, 7 > 1} be a sequence of independent and identically distributed 
random variables. The phrase “identically distributed’? means they have a 
common distribution, which need not be specified. But it is assumed that the 
mean and variance of each X; are finite and denoted by m and o? respectively, 
where 0 < o? < «. Define S, and Sz exactly as before, then 


(7.5.5) E(S,) = nm, o°(S,) = no? 


and (7.5.3) holds as before. Again let F,, denote the distribution of the nor- 
malized sum Sz. Then Theorem 8 below asserts that (7.5.4) remains true under 
the liberalized conditions for the X,’s. To mention just some simple cases, 
each X; may now be a “die-rolling” instead of a “coin-tossing’”’ random 
variable to which Theorem 6 is applicable; or it may be uniformly distributed 
(“point-picking”’ variable); or again it may be exponentially distributed 
(“telephone-ringing” variable). Think of some other varieties if you wish. 


Theorem 8. For the sums S, under the generalized conditions spelled out 
above, we have for anya < b: 


. S, — nm l 
156) tim P(a< =H <p) = 1 [wren 
( ) im a Vig = Yan _¢ x 


n— 0 


Proof: The powerful tool alluded to earlier is that of the characteristic func- 
tion discussed in §6.5. [We could not have used the moment-generating func- 
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tion since it may not exist for S,.] For the unit normal distribution 9, its 
characteristic function g can be obtained by substituting i@ for 6 in (7.4.4): 


(7.5.7) 2(0) = e- 8/2, 


With each arbitrary distribution F,, there is also associated its characteristic 
function g, which is in general expressible by means of F, as a Stieltjes 
integral. This is beyond the scope of this book but luckily we can by-pass it 
in the following treatment by using the associated random variables. [Evi- 
dently we will leave the reader to find out what may be concealed! ] We can 
now state the following result. 


Theorem 9. Jf we have for every 6, 
(7.5.8) lim g,(0) = g(6) = e-®”, 


then we have for every x: 


(7.5.9) lim F,(x) = ®(x) = | e~#/2 dy; 
n— via — © 


in particular (7.5.4) is true. 

Although we shall not prove this (see [Chung 1; Chapter 6], let us at 
least probe its significance. According to Theorem 7 of §6.5, each g, uniquely 
determines F,,, and g determines ®. The present theorem carries this corre- 
spondence between distribution function and its transform (characteristic 
function) one step further; for it says that the /imit of the sequence {g,,} also 
determines the /imit of the sequence {F,}. Hence it has been called the 
“continuity theorem”’ for the transform. In the case of the normal @ above 
the result is due to Pdélya; the general case is due to Paul Lévy (1886-1972) 
and Harald Cramér (1893-—), both pioneers of modern probability theory. 

Next we need a little lemma about characteristic functions. 


Lemma. Jf X has mean 0 and variance I, then its characteristic function h has 
the following Taylor expansion at 8 = 0: 


(7.5.10) h(6) =1— ra + €(6)) 


where ¢ is a function depending on h such that lim (6) = 0. 
é—0 


Proof: According to a useful form of Taylor’s theorem (look it up in your 
calculus book): if # has a second derivative at 6 = 0, then we have 


(7.5.11) h(6) = h(O) + h'(0)6 + “© 6(1 + €(6)). 
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From 
h(6) = E(e**) 


we obtain by formal differentiation: 


h'(6) = E(e**iX), h'(0) = Ele**GXx)?); 
hence 
h'0) = EGX) = 0, h’(O) = E(-X2) = — 
Substituting into (7.5.11) we get (7.5.10). 
Theorem 8 can now be proved by a routine calculation. Consider the 
characteristic function of Sz: 


E(e9Ss*) = E(e(Xit+ + +XnH)/ Vn) 


Since the X;’s are independent and identically distributed as well as the X,’s, 
by the analogue of Theorem 6 of §6.5 the right member above is equal to 


(7.5.12) E(e8X#*/Vnyn = h ( <) 


where /A denotes the characteristic function of X¢. It follows from the Lemma 
that 


(7.5.13) h(e) = 1-5 (1+ (Xe) 


where 6 is fixed and n — ©. Consequently we have 


lim E(e5>*) 


lim B(e**) = tim [1 — 5° (1+ «(7-)) | 


— e- 0/2 


by an application of (7.1.12). This means the characteristic functions of Sz 
converge to that of the unit normal, therefore by Theorem 9, the distribution 
F,, converges to ® in the sense of (7.5.9), from which (7.5.6) follows. 

The name “central limit theorem” is used generally to designate a con- 
vergence theorem in which the normal distribution appears as the limit. More 
particularly it applies to sums of random variables as in Theorem 8. Histori- 
cally these variables arose as errors of observations of chance fluctuations, so 
that the result is the all-embracing assertion that under “normal’’ conditions 
they all obey the same normal law, also known as the “error function.” For 
this reason it had been regarded by some as a law of nature! Even in this 
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narrow context Theorem 8 can be generalized in several directions: the 
assumptions of a finite second moment, of a common distribution, and of 
strict independence can all be relaxed. Finally, if the normal conditions are 
radically altered, then the central limit theorem will no longer apply, and 
random phenomena abound in which the limit distribution is no longer 
normal. The Poisson case discussed in §7.1 may be considered as one such 
example but there are other laws closely related to the normal which are 
called “‘stable” and “‘infinitely divisible’ laws. See [Chung 1; Chapter 7] for a 
discussion of the various possibilities mentioned here. 

It should be stressed that the central limit theorem as stated in Theorems 
6 and 8 is of the form (7.5.4), without giving an estimate of the “error” 
F.C) — (J). In other words, it asserts convergence without indicating any 
“speed of convergence.” This renders the result useless in accurate numerical 
computations. However, under specified conditions it is possible to obtain 
bounds for the error. For example in the De Moivre-Laplace case (7.3.19) 


we can show that the error does not exceed C/Wn where C is a numerical 
constant involving p but not a or b; see [Chung 1; §7.4] for a more general 
result. In crude, quick-and-dirty, applications the error is simply ignored, as 
will be done below. 

In contrast to the mathematical developments, simple practical applica- 
tions which form the backbone of “large sample theory”’ in statistics are 
usually of the cook-book variety. The great limit theorem embodied in (7.5.6) 
is turned into a rough approximate formula which may be written as follows: 


P(yoVn < S, — mn < xy0Vn) & B(x) — B(x). 


In many situations we are interested in a symmetric spread around the mean, 
i.€., X, = —X2, then the above becomes by (7.4.7): 


(7.5.14) P(|S, — mn| < x) & 28(x) — 1. 


Extensive tabulations of the values of ® and its inverse function ®~! are 
available; a short table is appended at the end of the book. The following 
example illustrates the routine applications of the central limit theorem. 


Example 7. A physical quantity is measured many times for accuracy. Each 
measurement is subject to a random error. It is judged reasonable to assume 
that it is uniformly distributed between — 1 and +1 in a conveniently chosen 
unit. Now if we take the arithmetical mean [average] of m measurements, 
what is the probability that it differs from the true value by less than a 
fraction 6 of the unit? 

Let the true value be denoted by m and the actual measurements obtained 
by X,;, 1 <j <n. Then the hypothesis says that 


Xj =m-+ &, 
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where £; is arandom variable which has the uniform distribution in [—1, +1]. 
Thus 


+ hy 1 
E(é;) = [ xdx=0, o°(&) = E(é) = [- 5x dx = 3; 
E(X;) = m, 0(X;) = 5 
In our notation above, we want to compute the approximate value of 
P{|S, — mn| < 6n}. This probability must be put into the form given in 
(7.5.6), and the limit relation there becomes by (7.5.14): 


Py 


For instance, if nm = 25 and 6 = 1/5, then the result is equal to 


Vin/3 


< sVin| & 26(5V3n) — 1. 


26(V3) — 1 = 28(1.73) — 1 & .92, 


from the Table on p. 321. Thus, if 25 measurements are taken, then we are 
92% sure that their average is within one fifth of a unit from the true value. 

Often the question is turned around: how many measurements should 
we take in order that the probability will exceed a (the “significance level’’) 
that the average will differ from the true value by at most 6? This means we 
must first find the value x, such that 


l+a. 
2 b] 


2P(xX.) —l=a, or (xa) = 
and then choose 7 to make 
éV 3n > Xe. 


For instance, if a = .95 and 6 = 1/5, then the Table shows that x, ~ 1.96; 
hence 
2 


Xa 
n> x5 32. 


Thus, seven or eight more measurements should increase our degree of confi- 
dence from 92% to 95%. Whether this is worthwhile may depend on the cost 
of doing the additional work as well as the significance of the enhanced 
probability. 

It is clear that there are three variables involved in questions of this kind, 
namely: 6, a and n. If two of them are fixed, we can solve for the third. Thus if 
n = 25 is fixed because the measurements are found in recorded data and not 
repeatable, and our credulity demands a high degree of confidence a, say 
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99%, then we must compromise on the coefficient of accuracy 6. We leave 
this as an exercise. 

Admittedly these practical applications of the great theorem are dull stuff, 
but so are e.g. Newton’s laws of motion on the quotidian level. 


7.6. Law of large numbers 


In this section we collect two results related to the central limit theorem: the 
law of large numbers and Chebyshev’s inequality. 

The celebrated Law of Large Numbers can be deduced from Theorem 8 
as an easy consequence. 


Theorem 10. Under the same conditions as in Theorem 8, we have for a fixed 
but arbitrary constant c > 0: 


(7.6.1) lim P ( 


n> 0 


Sn ml < c) = |, 
n 
Proof: Since c is fixed, for any positive constant /, we have 


(7.6.2) laVn < cn 


for all sufficiently large values of n. Hence the event 


S, — mn . . S, — mn 
—7#_| < i\ certainly implies eres < c 
{ oVn ¥ mp 
and so 
S, — mn S, — mn 
7.6.3 p(Pa—m <i) <P(P—™ < ) 
( ) oV/n 7 c 


for large n. According to (7.5.6) with a = —/, b = +/, the left member above 
converges to 


l 
al ene ax 
TJ-l 


as n— oo, Given any 6 > 0 we can first choose / so large that the value of the 
integral above exceeds 1 — 6, then choose n so large that (7.6.3). holds. It 
follows that 


(7.6.4) P (= _ mi < c) >1—-—6 


for all sufficiently large n, and this is what (7.6.1) ‘says. 
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Briefly stated, the law of large numbers is a corollary to the central limit 


theorem because any large multiple of Vn is negligible in comparison with 
any small multiple of 7. 

In the Bernoullian case the result was first proved by Jakob Bernoulli as 
a crowning achievement. [Jakob or Jacques Bernoulli (1654-1705), Swiss 
mathematician and physicist, author of the first treatise on probability: Ars 
conjectandi (1713) which contained this theorem.] His proof depended on 
direct calculations with binomial coefficients without of course the benefit of 
such formulas as Stirling’s. In a sense the DeMoivre-Laplace Theorem 6 was 
a sequel to it. By presenting it in reverse to the historical development it is 
made to look like a trivial corollary. As a matter of fact, the law of large 
numbers is a more fundamental but also more primitive limit theorem. It 
holds true under much broader conditions than the central limit theorem. 
For instance, in the setting of Theorem 8, it is sufficient to assume that the 
common mean of_X, is finite, without any assumption on the second moment. 
Since the assertion of the law concerns only the mean such an extension is 
significant and was first proved by A. Ya. Khintchine [1894—1959, one of the 
most important of the school of Russian probabilists]. In fact, it can be 
proved by the method used in the proof of Theorem 8 above, except that it 
requires an essential extension of Theorem 9 which will take us out of our 
depth here. (See Theorem 6.4.3 of [Chung 1].) Instead we will give an ex- 
tension of Theorem 9 in another direction, when the random variables {X,} 
are not necessarily identically distributed. This is easy via another celebrated 
but simple result known as Chebyshev’s inequality. [P. L. Chebyshev (1821- 
1894), together with A. A. Markov (1856-1922) and A. M. Ljapunov (1857— 
1918), were founders of the Russian school of probability. | 


Theorem 11. Suppose the random variable X has a finite second moment, then 
for any constant c > 0 we have 
2 

(7.6.5) P(\X|> 0) < Ae. 
Proof: We will carry out the proof for a countably valued X and leave the 
analogous proof for the density case as an exercise. The idea of the proof is 
the same for a general random variable. 

Suppose that X takes the values v, with probabilities p,, as in §4.3. Then 
we have 


(7.6.6) E(X?) = & p,v7. 
Jj 
If we consider only those values v, satisfying the inequality |v,| > c and 


denote by A the corresponding set of indices j, namely A = {j| |v,| > c}, 
then of course v7 > c? for j < A, whereas 


P(\X| = Cc)= » Pj: 
JCA 
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Hence if we sum the index j only over the partial set A, we have 
KX) > L pw > L per=c? Lp; = cP(\X| > o), 
JCA JCA 7€A 


which is (7.6.5). 
We can now State an extended form of the law of large numbers as follows. 


Theorem 12. Let {X,,j > 1} be a sequence of independent random variables 
such that for each j: 


(7.6.7) E(X,) = m,, 0%(X,) = 07; 

and furthermore suppose there exists a constant M < © such that for all ;: 
(7.6.8) o} < M. 

Then we have for each fixed c > 0: 


Motes +X mts: +m 


n> @ 


(7.6.9) lim P( 


<c)=1. 


Proof: If we write X; = X, — m,, Sr = p X>, then the expression between 
=1 
the bars above is just Sz/n. Of course E(S%) = 0, whereas 


E(St)?) = o(S%) = YX!) = ¥ of. 


This string of equalities follows easily from the properties of variances and 
you ought to have no trouble recognizing them now at a glance. [If you still 
do then you should look up the places in preceding chapters where they are 
discussed.| Now the condition in (7.6.8) implies that 


(7.6.10) E(($%)*) < Mn, E((=) » <M 


It remains to apply Theorem 11 to X¥ = S2/n to obtain 


0 
(7.6.11) P (= > 


2) < USI ¢ 


<= 


Hence the probability above converges to zero as n — ~, which is equivalent 
to the assertion in (7.6.9). 

Actually the proof yields more: it gives an estimate on the “‘speed of 
convergence.’ Namely, given M, c and 6 we can tell how large n must be in 
order that the probability in (7.6.9) exceeds 1 — 6. Note also that Theorem 10 
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is a particular case of Theorem 12 because there all the o7’s are equal and we 
may take M = oi. 

Perhaps the reader will agree that the above derivations of Theorems 11 
and 12 are relatively simple doings compared with the fireworks in §§7.3-7.4. 
Looking back, we may find it surprising that it took two centuries before the 
right proof of Bernoulli’s theorem was discovered by Chebyshev. It is an 
instance of the triumph of an idea, a new way of thinking, but even Cheby- 
shev himself buried his inequality among laborious and unnecessary details. 
The cleaning up as shown above was done by later authors. Let us observe 
that the method of proof is applicable to any sum S, of random variables, 
whether they are independent or not, provided that the crucial estimates in 
(7.6.10) are valid. 

We turn now to the meaning of the law of large numbers. This is best 
explained in the simplest Bernoullian scheme where each X; takes the values 
1 and O with probabilities p and g = 1 — p, as in Theorems 5 and 6 above. 
In this case Sr = S, — np and E((Sz)*) = o°(S,) = npq, so that (7.6.11) 
becomes, 


Sn gel 
(7.6.12) P( - 7 > c) Se, Se 


note that p(1 — p) < !1/4for0 < p < 1. In terms of coin-tossing with p as the 
probability of a head, S,/n represents the relative frequency of heads in 7 in- 
dependent tosses; cf. Example 3 of §2.1. This is of course a random number 
varying from one experiment to another. It will clarify things if we reinstate 
the long-absent sample point w and write explicitly: 


S,(w) _ relative frequency of heads in 7 tosses associated 
n with the experiment denoted by w. 


Thus each w 1s conceived as the record of the outcomes of a sequence of coin- 
tossing called briefly an experiment, so that different w’s correspond to differ- 
ent experiments. The sequence of tossing should be considered as infinite 
[indefinite, unending] in order to allow arbitrarily large values of n in com- 
puting the frequency [but we are not necessarily assuming the existence of a 
limiting frequency; see below]. Symbolically, w may be regarded as a concise 
representation of the sequence of successive outcomes: 


w = {X,w), Xow), ..., Xn(w),.. -}, 


see the discussion in §4.1. We can talk about the probability of certain sets of 
w; indeed we have written the probability of the set 


Si) _ 5] > e} 


A,(c) = {us 
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in (7.6.12); see also e.g. (7.3.10) and (7.3.19) where similar probabilities are 
evaluated. In the precise form given in (7.6.12), if the value c as well as any 
given positive number e is assigned, we can determine how large n need be to 
make P(A,(c)) > 1 — e, namely: 


(7.6.13) ies 


l 
<e or a 


This bound for n is larger than necessary because Chebyshev’s inequality is a 
crude estimate in general. A sharper bound can be culled from Theorem 6 
as follows. Rewrite the probability in (7.6.12) and use the approximation 
given in (7.3.19) simplified by (7.4.2): 


> ee) ol - #( = c) | 
Vq Pq 
This is not an asymptotic relation as it pretends to be, because the constants 


a and b must be fixed in Theorem 6, whereas here +cWVn/V pq vary with n 
(and increase too rapidly with n). It can be shown that the difference of the 


two sides above is of the form A/V n where A is a numerical constant de- 
pending on p, and this error term should not be ignored. But we do so below. 


Now put 
ref 
Pq” 


our problem is to find the value of » to make 


S, — np 


(7.6.14) P ( 


V npg 


(7.6.15) 2[1 — Bm] <e or my) >1—-§; 


then solve for n from 7. This can be done by looking up a table of values of 
®; a short one is appended at the end of this book. 


Example 8. Suppose c = 2% and e = 5%. Then (7.6.15) becomes 


5 
B(n) > 1 — aH = -975. 


From the table we see that this is satisfied if 7 > 1.96. Thus 


1.96)? 1.96)?- 1000 
n> | ohra | y 74. 


The last term depends on p, but p(1 — p) < 1/4 for all p, as already noted, 
and so n > 10000 - ; = 2500 will do. For comparison, the bound given in 
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(7.6.13) requires > 12500; but that estimate has been rigorously established 
whereas the normal approximation is a rough-and-dirty one. We conclude 
that if the coin is tossed more than 2500 times, then we can be 95° sure that 
relative frequency of heads computed from the actual experiment will differ 
from the true p by no more than 2%. 

Such a result can be applied in two ways (both envisioned by Bernoulli): 
(i) if we consider p as known then we can make a prediction on the outcome 
of the experiment; (ii) if we regard p as unknown then we can make an esti- 
mate of its value by performing an actual experiment. The second application 
has been called a problem of “inverse probability” and is the origin of the 
so-called Monte-Carlo method. Here is a numerical example. In an actual 
experiment 10000 tosses were made and the total number of heads obtained 
is 4979; see [Feller 1; p. 21] for details. The computation above shows that 
we can be 95% sure that 


| OM 2 ye 4779 < p< 5179. 


P ~ 70000! = 100 


Returning to the general situation in Theorem 10, we will state the law of 
large numbers in the following form reminiscent of the definition of-an ordi- 
nary limit. For any e« > O, there exists an m(e) such that for all n > no(e) we 
have 


(7.6.16) P (3 — nl < c) >l—e. 


We have taken both c and 6 in (7.6.4) to be e without loss of generality [see 
Exercise 22 below]. If we interpret this as in the preceding example as an 
assertion concerning the proximity of the theoretical mean m to the empirical 
average S,,/n, the double hedge [margin of error] implied by the two e’s in 
(7.6.16) seems inevitable. For in any experiment one can neither be 100% sure 
nor 100% accurate, otherwise the phenomenon would not be a random one. 
Nevertheless mathematicians are idealists and long for perfection. What can- 
not be realized in the empirical world may be achieved in a purely mathe- 
matical scheme. Such a possibility was uncovered by Borel in 1909, who 
created a new chapter in probability by his discovery described below. In the 
Bernoullian case, his famous result may be stated as follows: 


(7.6.17) P (tim Sn _ r) = 14 


noo I 


This is known as a “‘strong law of large numbers,” which is an essential 
improvement on Bernoulli’s ““weak law of large numbers.” It asserts the 
existence of a limiting frequency equal to the theoretical probability p, for all 
sample points w except possibly a set of probability zero (but not necessarily 
an empty set). Thus the limit in (2.1.10) indeed exists, but only for almost all 


t For a discussion of Borel’s theorem and related topics, see Chapter 5 of Chung [1]. 
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w, so that the empirical theory of frequencies beloved by the applied scientist 
is justifiable through a sophisticated theorem. The difference between this 
and Bernoulli’s weaker theorem: 


Ve > 0: lim P (|°* ~ pl < e) = |, 
is subtle and cannot be adequately explained without measure theory. The 
astute reader may observe that although we claim 100% certainty and accu- 
racy in (7.6.17), the limiting frequency is not an empirically observable thing— 
so that the cynic might say that what we are sure of is only an ideal, whereas 
the sophist could retort that we shall never be caught wanting! Even so a 
probabilistic certainty does not mean absolute certainty in the deterministic 
sense. There is an analogue of this distinction in the Second Law of Thermo- 
dynamics (which comes from statistical mechanics). According to that law, 
e.g., when a hot body is in contact with a cold body, it is logically possible 
that heat will flow from the cold to the hot, but the probability of this happen- 
ing is zero. A similar exception is permitted in Borel’s theorem. For instance, 
if a coin is tossed indefinitely, it is logically possible that it’s heads every 
single time. Such an event constitutes an exception to the assertion in (7.6.17), 
but its probability is equal to lim p” = 0. 


n= © 

The strong law of large numbers is the foundation of a mathematical 
theory of probability based on the concept of frequency; see §2.1. It makes 
better sense than the weak one and is indispensable for certain theoretical 
investigations. [In statistical mechanics it is known in an extended form under 
the name Ergodic Theorem.] But the dyed-in-the-wool empiricist, as well as a 
radical school of logicians called intuitionists, may regard it as an idealistic 
fiction. It is amusing to quote two eminent authors on the subject: 


Feller: “[the weak law of large numbers] is of very limited interest and 
should be replaced by the more precise and more useful strong law of 
large numbers” (p. 152 of [Feller 1]). 


van der Waerden: “(the strong law of large numbers] scarcely plays a role 
in mathematical statistics’? (p. 98 of Mathematische Statistik, 3rd ed., 
Springer-Verlag, 1971). 


Let us end this discussion by keeping in mind the gap between observable 
phenomena in the real world and the theoretical models used to study them; 
see Einstein’s remark on p. 123. The law of large numbers, weak or strong, 
is a mathematical theorem deduced from axioms. Its applicability to true-life 
experiences such as the tossing of a penny or nickel is necessarily limited and 
imperfect. The various examples given above to interpret and illustrate the 
theorems should be viewed with this basic understanding. 


Exercises 


1. Suppose that a book of 300 pages contains 200 misprints. Use Poisson 
approximation to write down the probability that there is more than 
one misprint on a particular page. 
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10.* 


11. 


12. 


13. 


14. 


15. 


Poisson and Normal Distributions 


In a school where 4% of the children write with their left hands, what is 
the probability that there are no left-handed children in a class of 25? 
Six dice are thrown 200 times by the players. Estimate the probability 
of obtaining “‘six different faces” k times, where k = 0, 1, 2, 3, 4, 5. 

A home bakery made 100 loaves of raisin bread using 2000 raisins. 
Write down the probability that the loaf you bought contains 20 to 30 
raisins. 

It is estimated that on a certain island of 15 square miles there are 20 
giant tortoises of one species and 30 of another species left. An ecological 
survey team spotted 2 of them in an area of 1 square mile, but neglected 
to record which species. Use Poisson distribution to find the probabili- 
ties of the various possibilities. 

Find the maximum term or terms in the binomial distribution B,(n; p), 
0 < k < n. Show that the terms increase up to the maximum and then 
decrease. [ Hint: take ratios of consecutive terms. | 

Find the maximum term or terms in the Poisson distribution 7;(qa), 
0 < k < ~. Show the same behavior of the terms as in No. 6. 

Let X be a random variable such that PLY = c + kh) = x(a) where c is 
a real and A is a positive number. Find the Laplace transform of X. 
Find the convolution of two sequences given by Poisson distributions 
{m(a)} and {m(6)}. 

If X, has the Poisson distribution z(qa), then 


. Xa —~ a 
lim Pt — <ul= 
iim Vr (u) 


for every u. [Hint: use the Laplace transform E(e-**«-©/~e), show that 
as a— it converges to e’/2, and invoke the analogue of Theorem 9 
of §7.5.] 

Assume that the distance between cars going in one direction on a cer- 
tain highway is exponentially distributed with mean value 100 meters. 
What is the probability that in a stretch of 5 kilometers there are between 
50 to 60 cars? 

On a certain highway the flow of traffic may be assumed to be Poissonian 
with intensity equal to 30 cars per minute. Write down the probability 
that it takes more than N seconds for n consecutive cars to pass by an 
observation post. [Hint: use (7.2.11).] 

A perfect die is rolled 100 times. Find the probability that the sum of 
all points obtained is between 330 and 380. 

It is desired to find the probability p that a certain thumb tack will fall 
on its flat head when tossed. How many trials are needed in order that 
we may be 95% sure that the observed relative frequency differs from 
p by less than p/10? [Hint: try it a number of times to get a rough 
bound for p. | 

Two movie theatres compete for 1000 customers. Suppose that each 
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16. 


17. 


18. 


19. 
20. 


21. 


22.* 


23. 


24. 


customer chooses one of the two with “total indifference’’ and independ- 
ently of other customers. How many seats should each theatre have so 
that the probability of turning away any customer for lack of seats is 
less than 1%? 

A sufficient number of voters are polled to determine the percentage in 
favor of a certain:candidate. Assume that an unknown proportion p of 
the voters favor him and they act independently of one another, how 
many should be polled to predict the value of p within 4.5% with 95% 
confidence? [This is the so-called “four percent margin of error in pre- 
dicting elections, presumably because <.045 becomes <.04 by the rule 
of rounding decimals. ] 

Write ®((a, b)) for (6) — (a) where a < b and ®@ is the unit normal 
distribution. Show that (0, 2)) > ®((1, 3)) and generalize to any two 
intervals of the same length. [Hint: e-*’/2 decreases as |x| increases. ] 
Complete the proof of (7.1.8) and thes use the same method to prove 


a1) (Hint: flog(l — x) +x < 5, Ixin X er a 5 hence if 


Ix] <= 5 this i is bounded by x?.] 


Prove 1.13) 
Prove Chebyshev’s inequality when X has a density. [Hint: o?(X) = 


[Po @— my¥e@ de> f & — mY@ dx] 


Prove the following analogue of Chebyshev’s inequality where the abso- 
lute first moment is used in place of the second moment: 


P(X — m| > ©) < + EX — ml). 


Show that lim P(|X,| > «) = 0 for every ¢ if and only if given any e, 


there exists #(e) such that 
P(\X,| >—) <e¢ for n> nie). 
This is also equivalent to: given any 6 and e, there exists 1)(6, «) such that 
P(|\X,| > e—) <6 for n> nof6, 6). 


[Hint: consider e’ = 6 A e¢ and apply the first form. |] 

If X has the distribution &, show that |X| has the distribution VY, where 
YW = 26 — 1; W 1s called the “positive normal distribution.” 

If X has the distribution #, find the density function of X? and the 
corresponding distribution. This is known as the “chi-square distribu- 


tion” in statistics. [Hint: differentiate P(X? < x) = 2/V Qn if Vt 9—ut/2 
du. | 
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25.* Use No. 24 to show that 


26. * 


27.* 


28. 


29. 


30. 


* 


[ x7 l2e-t dy = Vr. 


The integral is equal to T' (5) where I is the gamma function defined by 


I(a) = f ” xe-le-= dx for a > 0. [Hint: consider EX?) in No. 24.] 
Let {&, 1 < k <n} be n random variables satisfying 0 < & < & < 
eee < En < ft: 


— 9 


1 
let (0, t] = U JF, be an arbitrary partition of (0, ¢] into 
k=1 


subintervals J, = (xXx-1, X,] where x» = 0; and Nk) denote the number 
of é’s belonging to J,. How can we express the event {& <m;l1<k< 
I} by means of N(,), 1 < k < /? Here of course0 << m <u <-:' < 
Xn < t. Now suppose that x,, 1 < k < J, are arbitrary and answer the 
question again. [ Hint: try n = 2 and 3 to see what is going on; relabel 
the x, in the second part. | 

Let {X(t), t > 0} be a Poisson process with parameter a. For a fixed 
t > 0 define 6(¢) to be the distance from ¢ to the last jump before ¢ if 
there is one, and to be ¢ otherwise. Define 5’(t) to be the distance from t 
to the next jump after ¢. Find the distributions of 6(t) and 6’(¢). [Hint: 
ifu<t, P{6(t) > u} = P{M(t — u, t) = 0}; for all u> 0, P{5(t) > 
u} = P{N(t, t + u) = 0}.] 

Let r(t) = 6(t) + 62) as in No. 27. This is the length of the between- 
jump interval containing the given time ¢. For each w, this is one of the 
random variables 7; described in §7.2. Does 7(t) have the same expo- 
nential distribution as all the 7;,’s? [This is a nice example where logic 
must take precedence over “intuition,” and is often referred to as a 
paradox. The answer should be easy from No. 26. For further discussion 
at a level slightly more advanced than this book, see Chung, “The 
Poisson process as renewal process,” Periodica Mathematica Hungarica, 
Vol. 2 (1972), pp. 41-48. 

Use Chebyshev’s inequality to show that if X and Y are two arbi- 
trary random variables satisfying E{(X — Y)*} = 0, then we have 
P(X = Y) = 1, namely X and Y are almost surely identical. [Hint: 
P(| X — Y| >.) = Ofor anye > 0.] 

Recall the coefficient of correlation p(X, Y) from p. 169. Show that if 
p(X, Y) = 1, then the two “‘normalized’’ random variables: 


~ xX — E(X) - Y— EY) 
A=)? a) 
are almost surely identical. What if p(X, Y) = —1? [Hint: compute 


E{(X — Yj} and use No. 29.] 


Appendix 2 


Stirling’s Formula and De Moivre-Laplace’s Theorem 


In this appendix we complete some details in the proof of Theorem 5, es- 
tablish Stirling’s formula (7.3.3) and relate it to the normal integral (7.4.1). 
We begin with an estimate. 


Lemma. Jf |x| < 2/3, then 
2 
log(I+tx=x- = + (2) 
where |6(x)| < |x|’. 
Proof: We have by Taylor’s series for log (1 + x): 
2 00 nm 
lg(l+x)=x-F+ 0 (prt 
n=3 n 
Hence 6(x) is equal to the series above and 


|x| 
n 


_ 
31 — |x) 


ro) n 1 ro) 
ax< yo BP sod lap 
n=3 n=3 


For |x| < 2/3, 3(1 — |x|) > 1 and the lemma follows. The choice of the con- 
stant 2/3 is a matter of convenience; a similar estimate holds for any constant 
<1. 

We will use the Lemma first to complete the proof of Theorem 5, by show- 
ing that the omitted terms in the two series expansions in (7.3.17) may indeed 
be ignored as n—o. When n is sufficiently large the two quantities in 
(7.3.17’) will be <2/3. Consequently the Lemma is applicable and the contri- 
bution from the “‘tails” of the two series, represented by dots there, is bounded 
by 


3 


yng nnd 3 
k Vina + (a — ky MPH 


n—k 


Since pq < 1 and |x,| < A, this does not exceed 


n3/2 3 n3/2 3 
k? A T G@— bp 4” 


which clearly tends to zero as n > ~, by (7.3.15). Therefore the tails vanish 
in the limit and we are led to (7.3.18) as shown there. 
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Next we shall prove, as a major step toward Stirling’s formula, the rela- 
tion below: 


(A.2.1) lim 5 log n! — (x + 5) logn + n} =C 


nw 


where C is a constant to be determined later. Let d, denote the quantity 
between the braces in (A.2.1), then a simple computation gives 


d, — das = (0 +5) tog (1+ 2) ~1. 


Using the notation in the Lemma, we write this as 


l\ fl | l l l l 
(+5) (a + (5) — 1 = ("+ 3) 06) ~ ae 


and consequently by the Lemma with x = I/n,n > 2: 


1 w&t+l1, 1 
tant + ——. 


4n2 2p 4n? 


I\ 1 
Jd, — dess| < (n+ 5) 55+ 


Therefore the series >> |d, — d,+:| converges by the comparison test. Now 
n 


recall that an absolutely convergent series is convergent, which means the 
partial sum tends to a finite limit, say C,. Thus we have 


N 
lim > (d,, —_ dy+1) = Ci; 
1 


N->2 n= 
but the sum above telescopes into d, — dy, and so 


lim dns1 = d, — Ci, 
N- 0 


and we have proved the assertion in (A.2.1) with C = d, — C,. It follows that 


. nie” 
ee _seoe 
lim nnt(i/2) ~~ e's 
or if K = e¢: 
(A.2.2) nl ~ Knrt2e-n, 


If we compare this with (7.3.3) we see that it remains to prove that K = 
V2 to obtain Stirling’s formula. But observe that even without this evalua- 
tion of the constant K, the calculations in Theorems 5 and 6 of §7.3 are valid 
provided we replace V Qn by K everywhere. In particular, formula (7.3.19) 
with a = —b becomes 
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1 b 
— — 22/2 


Sn — np 
Vnpq 


(A.2.3) lim P ( 


n—> 


On the other hand, we may apply Theorem 11 (Chebyshev’s inequality) with 
X = (S, — np)/V npq, E(X) = 0, E(X?) = 1, to obtain the inequality: 


(A.2.4) P (2: at! < 6) syd. 


Combining the last two relations and remembering that a probability cannot 
exceed one, we obtain, 


1 [° 
—_ — __. — 42/2 
l E<$ ie dx <1. 


Letting b—> «© we conclude that 


(A.2.5) K = [. e-*/2 dy, 


Since the integral above has the value 27 by (7.4.1), we have proved that 
K = V2r. 

Another way of evaluating K is via the Wallis’s product formula given in 
many calculus texts (see e.g., Courant-John, Introduction to calculus and 
analysis, Vol. 1, New York: Interscience Publishers, 1965). If this is done then 
the argument above gives (A.2.5) with K = V2z, so that the formula for the 
normal integral (7.4.1) follows. This justifies the heuristic argument men- 
tioned under (7.4.1), and shows the intimate relation between the two results 
named in the title of this appendix. 


Chapter 8 
From Random Walks to Markov Chains 


8.1. Problems of the wanderer or gambler 


The simplest random walk may be described as follows. A particle moves 
along a line by steps; each step takes it one unit to the right or to the left with 
probabilities p and g = 1 — p respectively where 0 < p < 1. For verbal 
convenience we suppose that each step is taken in a unit of time so that the 
nth step is made instantaneously at time n; furthermore we suppose that the 
possible positions of the particle are the set of all integers on the coordinate 
axis. This set is often referred to as the “integer lattice’? on R! = (—»,«) 
and will be denoted by /. Thus the particle executes a walk on the lattice, 
back and forth, and continues ad infinitum. If we plot its position X, as a 
function of the time n, its path is a zigzag line of which some samples are 
shown below in Figure 30. 

A more picturesque language turns the particle into a wanderer or drunk- 
ard and the line into an endless street divided into blocks. In each unit of 
time, say 5 minutes, he walks one block from street corner to corner, and at 
each corner he may choose to go ahead or turn back with probabilities p or q. 
He is then taking a random walk and his track may be traced on the street 
with a lot of doubling and re-doubling. This language suggests an immediate 
extension to a more realistic model where there are vertical as well as hori- 
zontal streets, regularly spaced as in parts of New York City. In this case 
each step may take one of the four possible directions as in Figure 31. 
This scheme corresponds to a random walk on the integer lattice of the plane 
R*. We shall occasionally return to this below, but for the most part confine 
our discussion to the simplest situation of one dimension. 

A mathematical formulation is near at hand. Let &, be the nth step taken 
or displacement, so that 


_ ft with probability p, 


(8.1.1) gn —1 with probability q; 


and the &,’s are independent random variables. If we denote the initial posi- 
tion by Xo, then the position at time n (or after m steps) is just 


Thus the random walk is represented by the sequence of random variables 
{X,,n > 0} which is a stochastic process in discrete time. In fact, X, — Xo 
is a sum of independent Bernoullian random variables much studied in 
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Figure 30 


Chapters 5, 6 and 7. We have changed our previous notation (see e.g. (6.3.11)) 
to the present one in (8.1.2) to conform with later usage in §8.3. But apart 
from this what is new here? 

The answer is that our point of view will be new. We are going to study 
the entire walk, or process, as it proceeds, or develops, in the course of time. 
In other words, each path of the particle or wanderer will be envisioned as a 
possible development of the process subject to the probability laws im- 
posed on the motion. Previously we have been interested mostly in certain 
quantitative characteristics of X, (formerly S,) such as its mean, variance 
and distribution. Although the subscript n there is arbitrary and varies when 
n—o, a probability like P(a < X, <b) concerns only the variable YX, 
taken one at a time, so to speak. Now we are going to probe deeper into the 
structure of the sequence {X,,n > 0} by asking questions which involve 
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Figure 31 


many of them all at once. Here are some examples. Will the moving particle 
ever “hit” a given point? If so, how long will this take, and will it happen 
before or after the particle hits some other point? One may also ask how 
frequently the particle hits a point or a set; how long it stays within a set, etc. 
Some of these questions will be made precise below and answered. In the 
meantime you should let your fancy go free and think up a few more such 
questions, and perhaps relate them to concrete models of practical 
significance. 
Let us begin with the following problem. 


Problem 1. Consider the interval [0,c] wherec = a+banda>1,b> 1. 
If the particle starts at the point “‘a’’ what is the probability that it will hit 
one endpoint of the interval before the other? 

This is a famous problem in another setting, discussed by Fermat and 
Pascal and solved in general by Montmart. Two gamblers Peter and Paul 
play a series of games in which Peter wins with probability p and Paul wins 
with probability g, and the outcomes of the successive games are assumed to 
be independent. For instance they may toss a coin repeatedly or play ping- 
pong or chess in which their skills are rated as p to g. The loser pays a dollar 
each time to the winner. Now if Peter has $ a and Paul has § 5 at the outset 
and they continue to play until one of them is ruined (bankrupt), what is the 
probability that Peter will be ruined? 

In this formulation the position of the particle at any time m becomes the 
number of dollars Peter has after n games. Each step to the right is $1 won 
by him, each step to the left is $1 lost. If the particle reaches 0 before c, then 
Peter has lost all his initial capital and is ruined; on the other hand if the 
particle reaches c before 0, then Paul has lost all his capital and is ruined. 
The game terminates when one of these eventualities occurs. Hence the his- 
torical name of “‘gambler’s ruin problem.” 
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We are now going to solve Problem 1. The solution depends on the fol- 
lowing smart “put,” for 1 <<j<c—l1: 


(8.1.3) u, = the probability that the particle will reach 0 
before c, when it starts from /. 


The problem is to find u,, but since ‘“‘a’’ is arbitrary we really need all the u,’s. 
Indeed the idea is to exploit the relations between them and trap them to- 
gether. These relations are given by the following set of difference equations: 


(8.1.4) uy = pujn tqua, |l<j<c-—1 
together with the boundary conditions: 
(8.1.5) mw=1, u = 0. 


To argue (8.1.4), think of the particle as being at j and consider what will 
happen after taking one step. With probability p it will then be at j+ 1, 
under which hypothesis the (conditional) probability of reaching 0 before c 
will be u,4:; similarly with probability g it will be at 7 — 1, under which 
hypothesis the said probability will be u,_,. Hence the total probability u, is 
equal to the sum of the two terms on the right side of (8.1.4), by an applica- 
tion of Proposition 2 of §5.2. This argument spelled out in the extremal 
cases j = 1 and j = c — 1 entails the values of uw and u, given in (8.1.5). 
These are not included in (8.1.3) and strictly speaking are not well-defined 
by the verbal description given there, although it makes sense by a kind of 
extrapolation. 

The rest of our work is purely algebraic. Since p + q = 1 we may write 
the left member of (8.1.4) as pu, + qu;; after a transposition the equation 
becomes 


q(u; — Uj) = P(uj41 — Uy). 
Using the abbreviations 


q . 
r=-, d, = Uj — Uj415 


P 


we obtain the basic recursion between successive differences below: 
(8.1.6) d, = rd,_. 


Iterating we get d, = rido; then summing by telescoping: 


l 


c—1 
Up — Up = X (u; — U,41) 
(8.1.7) a 


c—1 e~1l 1 — re 
= > d, = >, rido = 1 do 
7=0 j3=0 —?r 
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provided that r ~ 1. Next we have similarly 


c—1 
U,; = U; —~ Ue = du (uj — Ui41) 
(8.1.8) ~ 


c—1 c—l ri — re 
~ d= 2, do = 1 7 As 
It follows that 
(8.1.9) u = F—, 0<j<c 


In case r = 1 we get from the penultimate terms in (8.1.7) and (8.1.8) that 


| = cdo, 
(8.1.10) uy = (¢ — fds = —*; 
b 
UqQ = -. 
Cc 


One half of Problem 1 has been completely solved; 1t remains to find 


v, = the probability that the particle will reach 
c before 0, when it starts from /. 


Exactly the same argument shows that the set of equations in (8.1.4) will be 
valid when the u’s are replaced by v’s, while the boundary conditions in 
(8.1.5) are merely interchanged: vp = 0, v. = 1. Hence we can find all v, by 
a similar method, which you may wish to carry out as an excellent exercise. 
However, there are quicker ways without this effort. 

One way is perhaps easier to understand by thinking in terms of the 
gamblers. If we change p into g (namely r into 1/r), and at the same time / 
into c — j (because when Peter has $ /, Paul has $ (c — /), and vice versa), 
then their roles are interchanged and so u, will go over into v, (not v,_,, why?). 
Making these changes in (8.1.9) and (8.1.10), we obtain 


l—-r , 
%=7o, WV Pa: 
,= 2 if p=q 


Now it is a real pleasure to see that in both cases we have 


(8.1.11) uts=1, O<j<e. 
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Thus as a by-product, we have solved the next problem which may have 
occurred to you in the course of the preceding discussion. 


Problem 2. If the particle starts inside the interval [0, c], what is the prob- 
ability that it will ever reach the boundary? 

Since the boundary consists of the two endpoints 0 and c, the answer is 
given by (8.1.11) and is equal to one. In terms of the gamblers, this means 
that one of them is bound to be ruined sooner or later if the game is con- 
tinued without a time limit; in other words it cannot go on forever. Now 
you can object that surely it is conceivable for Peter and Paul to seesaw end- 
lessly as e.g. indicated by the sequence +1—1+1—1+1-—-1.... 
The explanation is that while this eventuality is a logical possibility its prob- 
ability 1s equal to zero as just shown. Namely, it will almost never happen in 
the sense discussed at the end of §7.6, and this is all we can assert. 

Next, let us mention that Problem 2 can be solved without the interven- 
tion of Problem 1. Indeed, it is clear the question raised in Problem 2 is a 
more broad “qualitative” one which should not depend on the specific 
numerical answers demanded by Problem 1. It is not hard to show that even 
if the és in (8.1.2) are replaced by independent random variables with an 
arbitrary common distribution, which are not identically zero, so that we have 
a generalized random walk with all kinds of possible steps, the answer to 
Problem 2 is still the same in the broader sense that the particle will sooner 
or later get out of any finite interval (see e.g. [Chung 1; Theorem 9.2.3]). 
Specializing to the present case where the steps are +1, we see that the par- 
ticle must go through one of the endpoints before it can leave the interval 
[0, c]. If this conclusion tantamount to (8.1.11) is accepted with or without 
a proof, then of course we get v, = 1 — u, without further calculation. 

Let us state the answer to Problem 2 as follows. 


Theorem 1. For any random walk (with arbitrary p), the particle will almost 
surely} not remain in any finite interval forever. 

As a consequence, we can define arandom variable which denotes the wait- 
ing time until the particle reaches the boundary. This is sometimes referred 
to as “absorption time” if the boundary points are regarded as “‘absorbing 
barriers,’ namely the particle is supposed to be stuck there as soon as it hits 
them. In terms of the gamblers, it is also known as the ‘‘duration of play.” 
Let us put forl <j<c-—l1: 


(8.1.12) S, = the first time when the particle reaches 0 or c , 
starting from /; 


and denote its expectation E(S,) by e,. The answer to Problem 2 asserts that 
S, is almost surely finite, hence it is a random variable taking positive integer 
values. [Were it possible for S, to be infinite it would not be a random variable 
as defined in §4.2, since ‘“‘-++-” is not a number. However, we shall not 


+ In general, ‘‘almost surely’ means “with probability one’’. 
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elaborate on the sample space on which S, is defined; it is not countable! | 
The various e,’s satisfy a set of relations like the u,’s, as follows: 


(8.1.13) Qe, = peu tqaitl, l<jce-I, 
@& = 0, e, = 0. 

The argument is similar to that for (8.1.4) and (8.1.5), provided we explain 

the additional constant “1’’ on the right side of the first equation above. 

This is the unit of time spent in taking the one step involved in the argument 

from j toj + 1. 

The complete solution of (8.1.13) may be carried out directly as before, 
or more expeditiously by falling back on a standard method in solving dif- 
ference equations detailed in Exercise 13 below. Since the general solution is 
not enlightening we will indicate the direct solution only in the case p = gq = 
1/2, which is needed in later discussion. Let f, = e, — e,4:, then 


fr=fat2 fr=fot 2, 
0-FF- foto — 1). 


Hence fo = 1 — c, and after a little computation, 
c—1 c—1 
(8.1.14) 6e=Vf= Ld -—c+ 21) = jc — J). 
w=) 1==7 


Since the random walk is symmetric, the expected absorption time should be 
the same when the particle is at distance j from 0, or from c (thus at distance 
c — j from 0), hence it is a priori clear that e, = e,_, which checks out with 
(8.1.14). 


8.2. Limiting schemes 


We are now ready to draw important conclusions from the preceding for- 
mulas. First of all, we will convert the interval [0, c] into the half-line [0, ©) 
by letting c— +. It follows from (8.1.9) and (8.1.10) that 


, roif r<l; 
(8.2.1) tm j= 41 if p> 1. 


Intuitively, this limit should mean the probability that the particle will reach 
0 before “it reaches + ©”, starting from /; or else the probability that Peter 
will be ruined when he plays against an “‘infinitely rich” Paul, who cannot be 
ruined. Thus it simply represents the probability that the particle will ever 
reach 0 from j/, or that of Peter’s eventual ruin when his capital is $ 7. This 
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interpretation is correct and furnishes the answer to the following problem 
which is a sharpening of Problem 2. 


Problem 3. If the particle starts from a (> 1), what is the probability that it 
will ever hit 0? 

The answer is | if p <q; and (/p)* if p > q. Observe that when p < g 
the particle is at least as likely to go left as to go right, so the first conclusion 
is most plausible. Indeed, in case p < q we can say more by invoking the law 
of large numbers in its strong form given in §7.6. Remembering our new 
notation in (8.1.2) and that E(é,) = p — q, we see that in the present context 
(7.6.17) becomes the assertion that almost surely we have 


lim Gn 40 pg <0, 


n—- 0 


This is a much stronger assertion than that lim X, = —«. Now our particle 


Nn? © 


moves only one unit at a time, hence it can go to — only by passing through 
all the points to the left of the starting point. In particular it will almost 
surely hit 0 from a. 

In case p > gq the implication for gambling is curious. If Peter has a definite 
advantage, then even if he has only $1 and is playing against an unruinable 
Paul, he still has a chance 1 — q/p to escape ruin forever. Indeed, it can be 
shown that in this happy event Peter will win big in the following precise 
sense, where X, denotes his fortune after n games: 


P{X, > +0 |X, #90 forall n} = 1. 


[This is a conditional probability given the event {X, + 0 for all n}.] Is this 
intuitively obvious? Theorem | helps the argument here but does not clinch it. 

When p = q = 1/2 the argument above does not apply, and since in this 
case there is symmetry between left and right, our conclusion may be stated 
more forcefully as follows. 


Theorem 2. Starting from any point in a symmetric random walk, the particle 
will almost surely hit any point any number of times. 


Proof: Let us write i= j to mean that starting from i the particle will almost 
surely hit 7, where i € J, 7 € I. We have already proved that if 7 ¥ /, then 
i= j. Hence also j= i. But this implies j= / by the obvious diagram 
j=i=j. Hence also i> j=j=j—j..., which means that starting 
from i the particle will hit 7 as many times as we desire, and note that j = i 
is permitted here. 

We shall say briefly that the particle will hit any point in its range J 
infinitely often; and that the random walk is recurrent (or persistent). These 
notions will be extended to Markov chains in §8.4. 

In terms of gambling, Theorem 2 has the following implication. If the 
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game is fair, then Peter is almost sure to win any amount set in advance as 
his goal, provided he can afford to go into debt for an arbitrarily large 
amount. For Theorem 2 only guarantees that he will eventually win say 
$1000000 without any assurance as to how much he may have lost before 
he gains this goal. Not a very useful piece of information this—but strictly 
fair from the point of view of Paul! More realistic prediction is given in 
(8.1.10), which may be rewritten as 


b a 


(8.2.2) Ua = 7 + 6 Ve = a+b’ 


which says that the chance of Peter winning his goal b before he loses his entire 
capital a is in the exact inverse proportion of a to b. Thus if he has $100, 
his chance of winning $1000000 is equal to 100/1000100 or about one in ten 
thousand. This is about the state of affairs when he plays in a casino, even 
if the house does not reserve an advantage over him. 

Another wrinkle is added when we let c— -+ in the definition of e,. 
The limit then represents the expected time that the particle starting at j (> 1) 
will first reach O (without any constraint as to how far it can go to the right 
of j). Now this limit is infinite according to (8.1.14). This means, even if 
Peter has exactly $1 and is playing against an infinitely rich casino, he can 
“expect’’ to play a long, long time provided the game is fair. This assertion 
sounds fantastic as stated in terms of a single gambler, whereas the notion 
of mathematical expectation takes on practical meaning only through the 
law of large numbers applied to ‘“‘ensembles.” It is common knowledge that 
on any given day many small gamblers walk away from the casino with 
pocketed gains—they have happily escaped ruin because the casino did not 
have sufficient time to ruin them in spite of its substantial profit margin! 

Let us mention another method to derive (8.2.2) which is stunning. In the 
case p = q we have E(é,) = O for every n, and consequently we have from 
(8.1.2) that 


(8.2.3) E(Xn) = E(Xo) + EG) + +++ + En) = a. 


In terms of the gamblers this means that Peter’s expected capital remains 
constant throughout the play since the game is fair. Now consider the dura- 
tion of play S, in (8.1.12). It is a random variable which takes positive integer 
values. Since (8.2.3) is true for every such value might it not remain so when 
we substitute S, for n there? This is in general risky business but it happens 
to be valid here by the special nature of S, as well as that of the process {X,,}. 
We cannot justify it here (see Appendix 3) but will draw the conclusion. 
Clearly Xs, takes only the two value 0 and c by its definition; let 


(8.2.4) P(Xs, = 0) = p, P(Xs, = c) = 1-— op. 


Then 
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E(Xs,) = p:O+ (1 — p):c = CU — pie. 
Hence E(Xs,) = a means 


7 -¢._5. 
eT" O atb 
in agreement with (8.2.2). Briefly stated, the argument above says that the 
game remains fair up to and including the time of its termination. Is this 
intuitively obvious? 

We now proceed to describe a limiting procedure which will lead from 
the symmetric random walk to Brownian motion. The English botanist Brown 
observed (1826) that microscopic particles suspended in a liquid are subject 
to continual molecular impacts and execute zigzag movements. Einstein and 
Smoluchovski found that in spite of their apparent irregularity these move- 
ments can be analyzed by laws of probability, in fact the displacement over 
a period of time follows a normal distribution. Einstein’s result (1906) 
amounted to a derivation of the central limit theorem (see §7.4) by the 
method of differential equations. The study of Brownian motion as a sto- 
chastic process was undertaken by Wiener} in 1923, preceded by Bachelier’s 
heuristic work, and soon was developed into its modern edifice by Paul Lévy 
and his followers. Together with the Poisson process (§7.2) it constitutes one 
of the two fundamental species of stochastic processes, in both theory and 
application. Although the mathematical equipment allowed in this book is 
not adequate to treat the subject properly, it is possible to give an idea how 
the Brownian motion process can be arrived at through random walk and 
to describe some of its basic properties. 

The particle in motion observed by Brown moved of course in three 
dimensional space, but we can think of its projection on a coordinate axis. 
Since numerous impacts are received per second, we will shorten the unit of 
time; but we must also shorten the unit of length in such a way as to lead to 
the correct model. Let 6 be the new time-unit, in other words the time be- 
tween two successive impacts. Thus in our previous language t/6 steps are 
taken by the particle in old time ¢. Each step is still a symmetrical Bernoullian 
random variable but we now suppose that the step is of magnitude V6, 
namely for all k: 


Pe = Vi) = Pl = — V8) = 5 
We have then 
E(&) = 0, 0°(&) = 5 (vi) + 5(-Va) = 6. 
Let X) = 0 so that by (8.1.2) 


t+ Norbert Wiener (1894-1964), renowned U.S. mathematician, father of cybernetics. 
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(8.2.5) X= d &. 


k=1 


If 6 is much smaller than ¢, ¢/6 is large and may be thought of as an integer. 
Hence we have by Theorem 4 of §6.3: 


(8.2.6) E(X,) = 0, o°(X,) = : -8= 1, 


Furthermore if ¢ is fixed and 6— 0, then by the DeMoivre-Laplace central 
limit theorem (Theorem 6 of §7.3), X; will have the normal distribution 
N(0, t). This means we are letting our approximate scheme, in which the 


particle moves a distance of +68 with equal probability in old time 6, go 
to the limit as 6— 0. This limiting scheme is the Brownian motion, also 
called Wiener process, and here is its formal definition. 


Definition of Brownian Motion. A family of random variables {X(2)}, indexed 
by the continuous variable ¢ ranging over [0,) is called the Brownian 
Motion iff it satisfies the following conditions: 


(i) X(0) = 0; 
(ii) the increments X(s, + t,) — X(s.), over an arbitrary finite set of dis- 
joint intervals (s,, s, + t,), are independent random variables; 
(iii) for each s > 0, ¢ > 0, X(s + 1) — X(s) has the normal distribution 
N(O, ¢). 


For each constant a, the process {X(t) + a}, where X(t) is just defined, is 
called the Brownian motion starting at a. 

We have seen that the process constructed above by a limiting passage 
from symmetric random walks has the property (iii). Property (ii) comes 
from the fact that increments over disjoint intervals are obtained by summing 
the displacements & in disjoint blocks; hence the sums are independent by 
a remark made after Proposition 6 of §5.5. 

The definition above should be compared with that of a Poisson process 
given in §7.2, the only difference being in (iii). However, by the manner in 
which a Poisson process is constructed there, we know the general appear- 
ance of its paths as described under Figure 29. The situation is far from 
obvious for Brownian motion. It is one of Wiener’s major discoveries that 
almost all its paths are continuous; namely, for almost all w, the function 
t—>» X(t, w) is a continuous function of ¢ in [0, ©). In practice, we can discard 
the null set of w’s which yield discontinuous functions from the sample space 
Q, and simply stipulate that all Brownian paths are continuous. This is a 
tremendously useful property which may well be added to the definition 
above. On the other hand, Wiener also proved that almost every path is 
nowhere differentiable, i.e. the curve does not have a tangent anywhere— 
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which only goes to show that one cannot rely on intuition any more in these 
matters. 


Fig. 32. Brownian movement. Observations made 
at equal time intervals. The real path is even more 
complicated. 


However, it is not hard to guess the answers to our previous questions 
restated for Brownian motion. In fact, the analogue in Theorem 1 holds: 
starting at any point, the path will go through any other point infinitely 
many times. Note that because of the continuity of the path this will follow 
from the “intermediate value theorem”’ in calculus once we show that it will 
reach out as far as we wish. Since each approximating random walk has this 
property, it is obvious that the Brownian motion does too. Finally, let us 
show that the formula (8.2.2) holds also for Brownian motion, where u, and 
V, retain the same meanings as before but now a and c are arbitrary numbers 
such that 0 < a < c. Consider the Brownian motion starting at a; then it 
follows from property (i) that ECX;) = a for all ¢ > 0, which is just the 
continuous analogue of (8.2.3). Now we substitute 7, for t to get E(X7,) = a 
as before. This time, the continuity of paths assures us that at the instant 7, 
the position of the particle must be exactly at O or at c. In fact, the word 
“reach” used in the definition of uw, v, and 7, would have to be explained 
more carefully if the path could jump over the boundary. Thus we can again 
write (8.2.4) and get the same answer as for the symmetric random walk. 
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8.3. Transition probabilities 


The model of random walks can be greatly generalized to that of Markov 
chains, named after A. A. Markov (see §7.6). As the saying goes, one may 
fail to see the forest on account of the trees. By doing away with some cum- 
bersome and incidental features of special cases, a general theory emerges 
which is clearer and simpler and covers a wider range of applications. The 
remainder of this chapter is devoted to the elements of such a theory. 

We continue to use the language of a moving particle as in the random 
walk scheme, and denote its range by J. This may now be a finite or infinite 
set of integers, and it will soon be apparent that in general no geometric or 
algebraic structure (such as right or left, addition and subtraction) is required 
of J. Thus it may be an arbitrary countable set of elements, provided that 
we extend our definition of random variables to take values in such a set. 
[In §4.2 we have defined a random variable to be numerically valued.] We 
shall call J the state space and an element of it a state. For example, in physical 
chemistry a state may be a certain level of energy for an atom; in public 
opinion polls it may be one of the voter’s possible states of mind, etc. The 
particle moves from state to state and the probability law governing its 
change of states or transition will be prescribed, as follows. There is a set of 
transition probabilities p,,, where i € I, 7 © J, such that: if the particle is in 
the state i at any time, regardless of what state it has been in before then, the 
probability that it will be in the state j after one step is given by p.,. In symbols, 
if X, denotes the state of the particle at time 7, then we have 


(8.3.1) P{Xin =7| X= i; A} = P{Xin = J | XM = = pa, 


for an arbitrary event A determined by {X,..., X,_:} alone. For instance 
A may be a completely specified “‘past’’ of the form “Xp = i, X1 = ih, .-.., 
Xn—1 = in-),” OF a More general past event where the states i,,..., in are 
replaced by sets of states: “X)€ J, MC A,..., Xn-1 € Jn.” In the 
latter case some of these sets may be taken to be the whole space J, so that 
the corresponding random variables are in effect omitted from the condi- 
tioning; thus “X) € Joy, x1 € J, Xo € Jy” is really just “Xo © Jo, Xo © Jo.” 
The first equation in (8.3.1) renders the precise meaning of the phrase 
“regardless of prior history,” and is known as the Markov property. The 
second equation says that the conditional probability there does not depend 
on the value of n; this is referred to as the stationarity (or temporal homo- 
geneity) of the transition probabilities. Together they yield the following 
definition. 

Definition of Markov chain. A stochastic process {X,,n € N°} taking 


+ It may be more convenient in some verbal descriptions to begin with n = 1 rather than 
n= 0. 
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values in a countable set J is called a homogeneous Markov chain, or Markov 
chain with stationary transition probabilities, iff (8.3.1) holds. 

If the first equation in (8.3.1) holds without the second, then the Markov 
chain is referred to as being “non-homogeneous,”’ in which case the prob- 
ability there depends also on ” and must be denoted by p.,(n), say. Since we 
shall treat only a homogeneous chain we mean this case when we say ““Markov 
chain” or “‘chain”’ below without qualification. 

As a consequence of the definition, we can write down the probabilities 
of successive transitions. Whenever the particle is in the state ip, and regard- 
less of its prior history, the conditional probability that it will be in the states 
iy, I2,. . ., In, in the order given, during the next n steps may be suggestively 
denoted by the left member below and evaluated by the right member: 


(8.3.2) Pf see lb hha in} = PriiPurr *** Prn-stns 


where the five dots at the beginning serve to indicate the irrelevant and for- 
gotten past. This follows by using (8.3.1) in the general formula (5.2.2) for 
joint probabilities; for instance, 


P{X, = j, Xs = k, Xe = 1| Xs = if = P{Xs = j| Xs = If 
-P{X, = k| X; =i, X = j} P{X = 1| Xs = i, Xi = j, Xs = kK} 
= P{X,=j| X3; = i} P{X, = k| X, = j} P{X, = 1| X5 = k} 
= PrzPrkPkl- 


Moreover, we may adjoin any event A determined by {X, Xi, X2} alone 
behind the bars in the first two members above without affecting the result. 
This kind of calculation shows that: given the state of the particle at any 
time, its prior history is not only irrelevant to the next transition as postu- 
lated in (8.3.1), but equally so to any future transitions. Symbolically, for 
any event B determined by {Xnii, Xni2,...}, we have 


(8.3.3) P{B| X, =i; A} = P{B|X, =3 


as an extension of the Markov property. But that is not yet the whole story; 
there is a further and more sophisticated extension revolving around the 
three little words “‘at any time”’ italicized above, which will be needed and 
explained later. 

It is clear from (8.3.2) that all probabilities concerning the chain are 
determined by the transition probabilities, provided that it starts from a 
fixed state, e.g., Xo = i. More generally we may randomize the initial state 
by putting 


P{X = i} = Pi, ae E 


Then {p., i © J} is called the initial distribution of the chain and we have for 
arbitrary states io, i, ... 4 Un! 
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(8.3.4) P{X = 10, X1 = hh, se eg Xn — in} = Du Prun °° * Pani. 


as the joint distribution of random variables of the process. Let us pause to 
take note of the special case where for every i € J and j € I we have 


Pry = P). 


The right member of (8.3.4) then reduces to p,,p;, . . . P:,, and we see that the 
random variables Xo, X%, ... X, are independent with the common distribu- 
tion given by {p,,j © J}. Thus, a sequence of independent, identically dis- 
tributed and countably valued random variables is a special case of Markov 
chain, which has a much wider scope. The basic concept of such a scheme 1s 
due to Markov who introduced it around 1907. 

It is clear from the definition of p,, that we have: 


(8.3.5) (a) p20 for every i and j; 
7 (b) > py = 1 for every i. 
g€1 


Indeed it can be shown that these are the only conditions that must be 
satisfied by the p,,’s in order that they be the transition probabilities of a 
homogeneous Markov chain. In other words, such a chain can be con- 
structed to have a given matrix satisfying those conditions as its transition 
matrix. Examples are collected at the end of the section. 

Let us denote by pi; the probability of transition from i to j in exactly 


n steps, namely: 
(8.3.6) pi? = P{X, = j| Xo = i}. 


Thus p$ is our previous p,; and we may add 


0 if ij 
Pu = 81 {\ if i=; 


for convenience. The 6;, above is known as Kronecker’s symbol which you 
may have seen in linear algebra. We proceed to show that form > Lic J, 
k € I, we have 


(8.3.7) Die = DX pope? = DO py? Das 
j j 


where the sum is over J, an abbreviation which will be frequently used below. 

To argue this let the particle start from i, and consider the outcome after 

taking one step. It will then be in the state j with probability p,,; and condi- 

tioned on this hypothesis, it will go to the state k in n — 1 more steps with 
(n—1) 


probability p<’, regardless of what 7 1s. Hence the first equation in (8.3.7) 
is obtained by summing over all 7 according to the general formula for total 
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probabilities; see (5.2.3) or (5.2.4). The second equation in (8.3.7) is proved 
in a similar way by considering first the transition in n — 1 steps, followed 
by one more step. 

For n = 2, (8.3.7) becomes 


(8.3.8) Die = X PuPyks 
J 


which suggests the use of matrices. Let us arrange the p,,’s in the form of 
a matrix 


(8.3.9) II = [Pu], 


so that p.,, is the element at the ith row and jth column. Recall that the ele- 
ments of II satisfy the conditions in (8.3.5). Such a matrix is called stochastic. 
Now the product of two square matrices JJ, X [Iz is another such matrix 
whose element at the ith row and jth column is obtained by multiplying the 
corresponding elements of the ith row of JJ, with those of the jth column of 
II2, and then adding all such products. In case both J], and J], are the same 
II, this yields precisely the right member of (8.3.8). Therefore we have 


I? = I X I = [ps], 
and it follows by induction on n and (8.3.7) that 
Il” = 1 X > = Tl" X I = [pe]. 


In other words, the n-step transition probabilities pi? are just the elements in 
the nth power of II. If J is the finite set {1,2,...,7}, then the rule of multi- 
plication described above is of course the same as the usual one for square 
matrices (or determinants) of order r. When ‘J is an infinite set the same 
rule applies but we must make sure that the resulting infinite series such 
as the one in (8.3.8) are all convergent. This is indeed so, by virtue of 
(8.3.7). We can now extend the latter as follows. For n © N°, m € N° and 
i€_I,k € J, we have 


(8.3.10) pee” = 3d pip pr. 
| 


This set of equations is known as the Chapman-Kolmogorov equations. [Sydney 
Chapman, 1888-1970, English applied mathematician.] It is simply an ex- 
pression of the law of exponentiation for powers of IT: 


Te = 1" x 1", 


and can be proved, either by induction on m from (8.3.7), purely algebraically, 
or by a probabilistic argument along the same line as that for (8.3.7). Finally, 
let us record the trivial equation, valid for each n © N° andi € I: 
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(8.3.11) yp? =1. 
J 


The matrix JJ” may be called the n-step transition matrix. Using p& we can 
express joint probabilities when some intermediate states are not specified. 
An example will make this clear: 


P{X, = j, Xp = k, Xo = 1| X. = i} = pPpPp?. 


We are now going to give some illustrative examples of homogeneous 
Markov chains, and one which is non-homogeneous. 
Example 1. J = {..., —2, —1,0,1,2,.. .} is the set of all integers. 
p if7=it+1 
(8.3.12) Py =3q W7=i-|1 


0 otherwise; 


= 
I 
—) 
x 
) 
as) 
) 


where p+ q = 1, p> 0,q => 0. This is the free random walk discussed in 
§8.1. In the extreme cases p = 0 or g = 0, it is of course deterministic 
(almost surely). 


Example 2. J = {0,1,2,...} is the set of nonnegative integers; p,, is the 
same as in Example 1 for i ~ 0, but poo = 1 which entails po, = O for all 
j ~ 0. This is the random walk with one absorbing state 0. It is the model 
appropriate for Problem 3 in §8.1. The absorbing state corresponds to the 
ruin (state of bankruptcy) of Peter, whereas Paul is infinitely rich so that J is 
unlimited to the right. 


Example 3. J = {0,1,...,c},c > 2. 


For | <i<c — 1, the p,,’s are the same as in Example 1, but 
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(8.3.13) Po = l, Pee = l. 


This is the random walk with two absorbing barriers 0 and c, and is appro- 
priate for Problem | of §8.1. IJ is a square matrix of order c + 1. 


Example 4. In Example 3 replace (8.3.13) by 


Pa = |, Pet = I. 
0100... 
gOpod... 


This represents a random walk with two reflecting barriers such that after the 
particle reaches either endpoint of the interval [0, c], it is bound to turn back 
at the next step. In other words, either gambler will be given a $1 reprieve 
whenever he becomes bankrupt, so that the game can go on forever—for 
fun! We may also eliminate the two states 0 and c, and let J = {1,2,... 
c— I}, 


3 


Pu=Q Pi = Ps Perw—2 = Qs Pce~-1,e-1 = P 
gp00... 


Example 5. Let p > 0,g >0,r > 0andp+q+r= 1. In Examples 1 to4 
replace each row of the form(...qOp...) by(...qrp...). This means that 
at each step the particle may stay put, or that the game may be a draw, with 
probability r. When r = 0 this reduces to the preceding examples. 


Example 6. Let {&,,” > 0} be a sequence of independent integer-valued 
random variables such that all except possibly & have the same distribution 
given by {a,, k € I}, where J/ is the set of all integers. Define YY, as in (8.1.2): 


Xn = 3 f, n> OO. Since Xnyi = Xn + Ens, and & 4; is independent of 
k=0 
Xo, X,..., Xn, we have for any event A determined by X,..., X,_1 alone: 
P{Xna1 = J | AS Xn =} = Plo = jf —i| A; X= i} 
= Plira = Jj — i} = G1. 


Hence {X,,n > 0} constitutes a homogeneous Markov chain with the 
transition matrix [p,,], where 
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(8.3.14) Pij = Aju 


The initial distribution is the distribution of & which need not be the same 
as {a,}. Such a chain is said to be spatially homogeneous as p:, depends only 
on the difference j — i. Conversely, suppose {X,, n > 0} is a chain with the 
transition matrix given in (8.3.14), then we have 


PiXny — Xn = k| XX, = i} = PiXng =i +k |X =D = Pinte = Oe 


It follows that if we put &4; = Xn4i1 — X,, then the random variables 
{f,, n > 1} are independent (why?) and have the common distribution {a;}. 
Thus a spatially as well as temporally homogeneous Markov chain is identical 
with the successive partial sums of independent and identically distributed 
integer-valued random variables. The study of the latter has been one of our 
main concerns in previous chapters. 

In particular, Example 1 is the particular case of Example 6 with a, = p, 
a_, = q; we may add a = ras in Example 5. 


Example 7. For each i € J let p, and g, be two nonnegative numbers satisfy- 
ing p, + q. = 1. Take J to be the set of all integers and put 


dD: if j = it l, 
(8.3.15) Py FAQ if j =1— I, 


QO otherwise. 


In this model the particle can move only to neighboring states as in Example 
1, but the probabilities may now vary with the position. The model can be 
generalized as in Example 5 by allowing also the particle to stay put at each 
position i with probability r,, with p.+q,+/r. = 1. Observe that this 
example contains also Examples 2, 3 and 4 above. The resulting chain is no 
longer representable as sums of independent steps as in Example 6. For a 
full discussion of the example see [Chung; 2]. 


Example 8. (Ehrenfest model). This may be regarded as a particular case of 
Example 7 in which we have J = {0,1,...,c} and 


c—1 H 
» Py = = 


(8.3.16) Pin = C C 


It can be realized by an urn scheme as follows. An urn contains c balls, each 
of which may be red or black; a ball is drawn at random from it and replaced 
by one of the other color. The state of the urn is the number of black balls in 
it. It is easy to see that the transition probabilities are as given above and the 
interchange can go on forever. P. and T. Ehrenfest used the model to study 
the transfer of heat between gas molecules Their original urn scheme is 
slightly more complicated (see Exercise 14). 
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Example 9, Let J = {0,1,2,...} and 
Po = Ps Pinu =1l—p; foricl 


The p,’s are arbitrary numbers satisfying 0 < p, < 1. This model is used to 
study a recurring phenomenon which is represented by the state 0. Each 
transition may signal an occurrence of the phenomenon, or else prolong the 
waiting time by one time unit. It is easy to see that the event “‘X, = k” 
means that the last time <” when the phenomenon occurred is at timen — k, 
where 0 < k < n; in other words there has been a waiting period equal to k 
units since that occurrence. In the particular case where all p; are equal to p 
we have 


P{X, #0 forl<v<n—1;X%, =0|X%=0} =(1 — py. 


This gives the geometric waiting time discussed in Example 8 of §4.4. 


Example 10. Let J be the integer lattice in R?, the Euclidean space of d 
dimensions. This is a countable set. We assume that: starting at any lattice 
point, the particle can go only to one of the 2d neighboring points in one 
step, with various (not necessarily equal) probabilities. For d= 1 this 
is just Example 1, for d = 2 this is the street wanderer mentioned in §8.1. 
In the latter case we may represent the states by (i, i’) where i € J, i’ CJ; 
then we have 


Pr f7=i+.,j=0; 


P2 fyj=i1-1, jf =7 
Pan 9") D3 if j _— i, j’ = i’ + l; 


Ba fjy=ij7 =i —-1; 


where Pp, + po + ps + ps = I. If all these four probabilities are equal to 1/4 
the chain is a symmetric two-dimensional random walk. Will the particle 
still hit every lattice point with probability one? Will it do the same in three 
dimensions? These questions will be answered in the next section. 


Example 11. (non-homogeneous Markov chain). Consider the Pdélya urn 
scheme described in §5.3 with c > 1. The number of black balls in the urn is 
called its state so that “X,, = i’? means that after n drawings and insertions 
there are 7 black balls in the urn. Clearly each transition either increases this 
number by c or leaves it unchanged, and we have 


i ee. 
bardne ifj=ite, 
(8.3.17) P{Xnu =j| Xn, = i; A} = _ i to 
b+r+ne Wy= i, 


0 — otherwise; 
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where A is any event determined by the outcomes of the first n — 1 drawings. 
The probability above depends on n as well as i and j, hence the process 
is a non-homogeneous Markov chain. We may also allow c = —1, which 
is the case of sampling without replacement and yields a finite sequence of 
{X,;0<n< b+ r}. 


Example 12. It is trivial to define a process which is not Markovian. For 
instance in Example 8 or 11, let X,, = 0 or 1 according as the nth ball drawn 
is red or black. Then it is clear that the probability of “‘X,,,; = 1” given the 
values of X%1,..., X, will not in general be the same as given the value of X, 
alone. Indeed the latter probability is not very useful. 


8.4. Basic structure of Markov chains 


We begin a general study of the structure of homogeneous Markov chains by 
defining a binary relation between the states. We say ‘‘i leads to /’’ and write 
“ios 7” iff there exists n > 1 such that pi > 0; we say “i communicates 
with 7” and write “in, 7’’ iff we have both ios j and j vi. The relation ‘“uv” 
is transitive, namely if i. 7 and j ~~ k then iw k. This follows from the 
inequality 


(8.4.1) pet” > py pe 


which 1s an algebraic consequence of (8.3.10), but perhaps even more obvious 
from its probabilistic meaning. For if it is possible to go from i to / in n steps, 
and also possible to go from / to k in m steps, then it is possible by combining 
these steps to go from i to k inn -+ m steps. Here and henceforth we shall 
use such expressions as “‘it is possible” or ‘“‘one can” to mean with positive 
probability; but observe that even in the trivial argument just given the 
Markov property has been used and cannot be done without. The relation 
*““ W4*? 1s clearly both symmetric and transitive and may be used to divide the 
states into disjoint classes as follows. 


Definition of Class. A class of states is a subset of the state space such that 
any two States (distinct or not) in the class communicate with each other. 

This kind of classification may be familiar to you under the name of 
“equivalence classes.”’ But here the relation “‘ F¥ ”’ is not necessarily reflexive; 
in other words there may be a state which does not lead to itself, hence it does 
not communicate with any state. Such states are simply unclassified! On the 
other hand, a class may consist of a single state i: this is the case when 
Pu = 1. Such a state is called an absorbing state. Two classes which are not 
identical must be disjoint, because if they have a common element they must 
merge into one class via that element. 

For instance in Examples 1, 4, 5, 8 and 9 all states form a single class 
provided p > 0 and g > 0; as also in Example 7 provided p, > 0 and q; > 0 
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for all i. In Example 2 there are two classes: the absorbing state 0 as a single- 
ton and all the rest as another class. Similarly in Example 3 there are three 
classes. In Example 6 the situation is more complicated. Suppose for instance 
the a;,’s are such that a, > Oif k is divisible by 5, and a, = 0 otherwise. Then 
the state space J can be decomposed into five classes. Two states i and / 
belong to the same class if and only if i — j is divisible by 5. In other words, 
these classes coincide with the residue classes modulo 5. It is clear that in such 
a situation it would be more natural to take one of these classes as the reduced 
state space, because if the particle starts from any class it will (almostly 
surely) remain in that class forever, so why bother dragging in those other 
states it will never get to? 

In probability theory, particularly in Markov chains, the first instance of 
occurrence of a sequence of events is an important notion. Let j be an 
arbitrary state and consider the first time that the particle enters it, namely: 


(8.4.2) T,(w) = min {n > 1| X,(w) = jt, 


where the right member reads as follows: the minimum positive value of n 
such that X, = j. For some sample point w, X,(w) may never be /, so that no 
value of n exists in the above and T, is not really defined for that w. In such 
a case we shall define it by the decree: T,(w) = ©. In common language, “‘it 
will never happen’ may be rendered into “one can wait until eternity (or ‘hell 
freezes over’).”’ With this convention T, is a random variable which may take 
the value «. Let us denote the set {1, 2,: ..,0} by N,,. Then 7; takes values 
in N,,; this is a slight extension of our general definition in §4.2. 

We proceed to write down the probability distribution of T,. For sim- 
plicity of notation we shall write P,{---} for probability relations associated 
with a Markov chain starting from the state i. We then put, forn € N,,: 


(8.4.3) P = P,{T; = nh, 
and 
(8.4.4) fis = LSP = PAT; < @}. 


Remember that >| really means 5°; since we wish to stress the fact that 


n=1 l<n<o 
the value for the superscript is not included in the summation. It follows 
that 


(8.4.5) PAT, = 0} =f =1—ft. 
Thus {f7, 2 © N,,} is the probability distribution of T, for the chain starting 


from i. 
We can give another more explicit expression for f{? etc., as follows: 
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i? = py = P(X, = j}, 

oO = PAX, ¥jfori<v<n—1;X%, =j}, n> 2; 
bp? = P,{X, ¥ J for ally > 1}, 

fi, = P.{X, = j for some v > 1}. 


(8.4.6) 


Note that we may have i = / in the above, and “‘for some v’’ means “for at 
least one value of v.”’ 

The random variable 7, is called the first entrance time into the state j; the 
terms “‘first passage time’’ and “‘first hitting time’’ are also used. It is note- 
worthy that by virtue of homogeneity we have 


(8.4.7) fi? = PiXnan XJ for 1 <vo<n—- 13; Xnin = J | Xn = i} 


for any m for which the conditional probability is defined. This kind of 
interpretation will be constantly used without specific mention. 
The key formula connecting the f{ and p%” will now be given. 


Theorem 3. For any i and j, and 1 <n < «, we have 


(8.4.8) pis = Dy fps (m0) 


Proof: This result is worthy of a formal treatment in order to bring out the 
basic structure of a homogeneous Markov chain. Everything can be set down 
in a string of symbols: 


pS) = PAXe =f} = PAT, S03 Xn =f} = Le PAT, = 05 Xn = 7} 


= 2D PAT, = oPi{Xn = F|T; = 0} 
= 2 PAT, = BP {hn = iM AH. Ma eh Xe =H} 
=D PAT; = P{Xn = j| X= 3} 


= >) PAT, = vp P{Xn—» = j} 
= Lf Py”. 


Let us explain each equation above. The first is the definition of p{; the 
second because {X, = j} implies {T, < n}; the third because the events 
{T, = v} for 1 < v < n are disjoint; the fourth is by definition of conditional 
probability; the fifth is by the meaning of {7, = v} as given in (8.4. Vos the 
sixth by the Markov property in (8.3.1) since {X%, # j,..., M31 #j, X) = J}, 
as well as {X) = i} implicit in the notation P,, constitutes an event prior to 
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the time v; the seventh is by the temporal homogeneity of a transition from 
j to jinn — v steps; the eighth is just notation. The proof of Theorem 3 is 
therefore completed. 

True, a quicker verbal account can be and is usually given for (8.4.8), but 
if you spell out the details and pause to ask ‘“‘why” at each stage, it will come 
essentially to a rough translation of the derivation above. This is a pattern 
of argument much used in a general context in the advanced theory of Markov 
processes, so a thorough understanding of the simplest case as this one is 
well worth the pains. 

For i + j, the formula (8.4.8) relates the transition matrix elements at (i, /) 
to the diagonal element at (/, /). There is a dual formula which relates them 
to the diagonal elements at (i, 7). This is obtained by an argument involving 
the /ast exit from i as the dual of first entrance into j. It is slightly more tricky 
in its conception and apparently known only to a few specialists. We will 
present it here for the sake of symmetry—and mathematical beauty. Actually 
the formula is a powerful tool in the theory of Markov chains, although it is 
not necessary for our discussions here. 


ae ce ee SN SO ND Ce a ee Sn SO ONO On Ly 
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Y \ 
Figure 33 
Define forn > 1: 
(8.4.9) UM @) = max {(0< v<n| X,) = i}; 


namely U;” is the last exit time from the state i before or at the given time n. 
This is the dual of 7, but complicated by its dependence on n. Next we 
introduce the counterpart of f{, as follows: 


(1) 
8i9 = Pry, 
(8.4.10) gy = PAX, Xiforl<v<cn—-—1;X%, =f}, 2<n<o. 


Thus gi” is the probability of going from i to j in n steps without going 
through 7 again (for n = 1 the restriction is automatically satisfied since 


i ~ j). Contrast this with f{? which may now be restated as the probability 
of going from i to j in n steps without going through j before. Both kinds of 
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probability impose a taboo on certain passages and are known as taboo prob- 
abilities (see [Chung 2; §1.9] for a fuller discussion). We can now state the 
following result. 


Theorem 4. For i 4 j, andn > 1, we have 
n—l1 
(8.4.11) pip = y, Digi”. 


Proof: We shall imitate the steps in the proof of Theorem | as far as possible, 
thus: 


n—1 
py = P{X,=j} =P, {0 < UM <n-1, X=} = X PLUM =0; X,=/} 
v=0 
= > PAX, =i, X,Xiforv+l<u<n—-1; X, = j} 
0 


n—1 
=> P{X, =P {X%, x iforl<u<n—-v-1;X%.=)} 


(0) 4 (n — 0) 
t &tj . 


The major difference lies in the fourth equation above, but this is obvious 
from the meaning of U{”. We leave the rest to the reader. 
We put also 


(8.4.12) gi = © gi. 


However, while each term in the series above is a probability, it is not clear 
whether the series converges (it does, provided i-~* j; see Exercise 33 below). 
In fact, gi} may be seen to represent the expected number of entrances in j 
between two successive entrances in i. 

Theorems 3 and 4 may be called respectively the first entrance and last exit 
decomposition formulas. Used together they work like the two hands of a 
human being, though one can do many things with one hand tied behind 
one’s back, as we shall see later. Here as a preliminary ambidextrous appli- 
cation let us state the following little proposition as a lemma. 


Lemma. i 1s j is equivalent to fi, > 0 and to gi; > 0. 


Proof: If fi; = 0, then f{? = 0 for every n and it follows from (8.4.8) that 


pi = 0 for every n. Hence it is false that i ~ j. Conversely if ft; > 0, then 
(n) 


§ > 0 for some n; since pi? > f{? from the meaning of these two proba- 
bilities, we get pi? > 0 and so ins j. 
Now the argument for gi, is exactly the same when we use (8.4.11) in lieu 


of (8.4.8), demonstrating the beauty of dual thinking. 


Let us admit that the preceding proof is unduly hard in the case of f%, 
since a little reflection should convince us that “is j” and “fi > 0” both 


8.4. Basic structure of Markov chains 265 


mean: “‘it is possible to go from 7 to j in some steps,”’ (see also Exercise 31). 
However, is it equally obvious that “gi, > 0”? means the same thing? The 
latter says that it is possible to go from i to j in some steps without going 
through i again. Hence the asserted equivalence will imply this: if it is possible 
to go from i to j then it is also possible to do so without first returning to i. 
For example, since one can drive from New York to San Francisco, does it 
follow that one can do that without coming back for repairs, forgotten items, 
or a temporary postponement? Is this so obvious that no proof is needed? 

An efficient way to exploit the decomposition formulas is to introduce 
generating functions associated with the sequences {pj’, n > 0} (see §6.5): 


0 


P,,(z) = a zn, 
F;(z) = Xf 2, 
G,,(z) = 2 gi) 2", 


where |z| < 1. We have then by substitution from (8.4.8) and inverting 
the order of summation: 


Piz) = by + & (FPP?) 22 


(8.4.13) = 63; + L faz” du py? zn-* 
= 6: + Fi,(2)Pi(2). 


The inversion is justified because both series are absolutely convergent for 
|z| < 1. In exactly the same way we obtain for i # /: 


(8.4.14) P;;(Z) = P:(z)G:;(2). 


The first application is to the case i = 7. 


Theorem 5. For any state i we have fi, = 1 if and only if 
(8.4.15) ~ Pr =o; 
0 
if fi: < 1, then we have 
- (n) l 
(8.4.16) dL, pit 


Proof: From (8.4.13) with i = / and solving for P,,(z) we obtain 
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] 
— 1 — F,(2) 


If we put z = 1 above and observe that 


(8.4.17) Piz) = j 


P.{1) = Li Pi Fl) = fa; 


both assertions of the theorem follow. Let us point out that strictly speaking 
we must let z f 1 in (8.4.17) (why?), and use the following theorem from 


calculus. If c, > 0 and the power series C(z) = >> Caz" converges for |z| < 1, 
n=0 


then lim C(z) = > cna, finite or infinite. This important result is called an 
zl n=0 


Abelian theorem (after the great Norwegian mathematician Abel) and will be 
used again later. 
The dichotomy in Theorem 5 yields a fundamental property of a state. 


Definition of recurrent and nonrecurrent state. A state i is called recurrent iff 
fi: = 1, and nonrecurrent iff fi: < 1. 

The adjectives ‘“‘persistent”’ and “‘transient’’ are used by some authors for 
“recurrent” and “nonrecurrent.”” For later use let us insert a corollary to 
Theorem 5 here. 


Corollary to Theorem 5. If j is nonrecurrent, then x, pi? < © for every i. 


In particular, lim pi = 0 for every i. 


Proof: For i = /, this is just (8.4.16). If i ¥ 7, this follows from (8.4.13) since 
Pi) = FiQ)P;,(1) < Pj) < ©. 


It is easy to show that two communicating states are either both recurrent 
or both nonrecurrent. Thus either property pertains to a class and may be 
called a class property. To see this let i ps j, then there exists m > 1 and m’ > 1 
such that pf” > 0 and p{” > 0. Now the same argument for (8.4.1) leads to 
the inequality: 


(m’ +n +m) (m’) ,,(”) (mn) 
Pi = Pi Pit Pi 


Summing this over 2 > 0, we have 


(8.4.18) x ps” > x, py +n+m) > pp (3 p®) p mn) 
n=0 


If i is recurrent, then by (8.4.15) the last term above is infinite, hence so is 
the first term, and this means 7 is recurrent by Theorem 5. Since i and j are 
interchangeable we have proved our assertion regarding recurrent states. The 
assertion regarding nonrecurrent states then follows because nonrecurrence 
is just the negation of recurrence. 
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The preceding result is nice and useful, but we need a companion which 
says that it is impossible to go from a recurrent to a nonrecurrent state. [The 
reverse passage is possible as shown by Example 3 of §8.3.] This result lies 
deeper and will be proved twice below by different methods. The first relies 
on the dual Theorems 3 and 4. 


Theorem 6. If i is recurrent and ina j, then j is also recurrent. 


Proof: There is nothing to prove if i = 7, hence we may suppose i # j. We 
have by (8.4.13) and (8.4.14): 


P.(z) = F,,(2)P,(2), Pi(z) = P.u(z)Gi(2), 
from which we infer 
(8.4.19) F,,(z)P,,(z) = P.(z)G,,(2). 
If we let z f 1 as at the end of proof of Theorem 5, we obtain: 

Fj()P,,(01) = Pu(l)G.,01) = © 

since G,,(1) > 0 by the Lemma and P;,(1) = © by Theorem 5. Since F,,(1) > 0 
by the Lemma we conclude that P,,(1) = ©, hence jis recurrent by Theorem 5. 
This completes the proof of Theorem 6 but let us note that the formula 


(8.4.19) written in the form 


P.(z) _ Fiz) 
P,,(Z) 7 G,,(Z) 


leads to other interesting results when z7 1, called “ratio limit theorems” 
(see [Chung 2; §1.9)). 


8.5. Further developments 


To probe the depth of the notion of recurrence we now introduce a new 
“transfinite” probability, that of entering a given state infinitely often: 


(8.5.1) gq. = P.{X, = j for an infinite number of values of n}. 


We have already encountered this notion in Theorem 2 of §8.2; in fact the 
latter asserts in our new notation that g,, = 1 for every i and j in a symmetric 
random walk. Now what exactly does “infinitely often” mean? It means 
“again and again, without end,”’ or more precisely: “‘given any large number, 
say m, it will happen more than m times.” This need not strike you as any- 
thing hard to grasp, but it may surprise you that if we want to express q,, in 
symbols, it looks like this (cf. the end of §1.3): 
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qu = P, A U X= jl} 


m=ln=m 


For comparison let us write also 
fo= P10 Le =i} 


However, we will circumvent such formidable formalities in our discussion 
below. 
To begin with, it is trivial from the meaning of the probabilities that 


(8.5.2) | Qi <fi 


because “infinitely often’’ certainly entails “‘at least once.’ The next result is 
crucial. 
Theorem 7. For any state i, we have 


_ {" if i is recurrent, 
de QO if iis nonrecurrent. 


Proof: Put X) = i, and a = f%. Then a is the probability of at least one 
return to i. At the moment of the first return, the particle is in i and its prior 
history is irrelevant; hence from that moment on it will move as if making 
a fresh start from i (“like a newborn baby’’). If we denote by R,, the event 
of “at least m returns,” then this implies that the conditional probability 
P(R, | R,) is the same as P(R,) and consequently 


P(R2.) = PCR, Re) = PCR,)PCR2| Ri) = aa = a. 
Repeating this argument, we have by induction for m > 1: 
P(Ringt) = PRRs) = PCRm)P(Rmsi | Rm) = a = atl, 


Therefore the probability of infinitely many returns is equal to 


(8.5.3) lim P(R,) = lim am = ‘0 ifa =, 


m—> m—> © if a < l, 
proving the theorem. 


Now is a good stopping time to examine the key point in the preceding 
proof: 


P(Rn41 | Rn) = & 
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which is explained by considering the moment of the mth return to the initial 
state i and starting anew from that moment on. The argument works because 
whatever has happened prior to the moment is irrelevant to future happen- 
ings. [Otherwise one can easily imagine a situation in which previous returns 
tend to inhibit a new one, such as visiting the same old tourist attraction. | 
This seems to be justified by the Markov property except for one essential 
caveat. Take m = | for definiteness; then the moment of the first return is 
precisely the 7, defined in (8.4.2), and the argument above is based on apply- 
ing the Markovian assumption (8.3.3) at the moment 7,. But 7; is a random 
variable, its value depends on the sample point w; can we substitute it for the 
constant time n in those formulas? You might think that since the latter holds 
true for any n, and T,(w) is equal to some n whatever w may be, such a substi- 
tution must be “OK”. (Indeed we have made a similar substitution in §8.2 
without justification.) The fallacy in this thinking is easily exposed, but here 
we will describe the type of random variables for which the substitution is 
legitimate. 

Given the homogeneous Markov chain {X,, n € N°}, a random variable 
T is said to be optional [or a stopping time] iff for each n, the event {T = n} 
is determined by {X, Xi, ..., X,} alone. An event is prior to T iff it is de- 
termined by {Xo, Xi, ..., Xr_i}, and posterior to T iff it is determined by 
{Xrii, Xr4o,...$.(When T = 0 there is no prior event to speak of.) The state 
of the particle at the moment 7 is of course given by Xr [note: this is the 
random variable w — Xr,..)(w)]. In case T is a constant n, these notions agree 
with our usual interpretation of “‘past” and “‘future’’ relative to the ‘‘present”’ 
moment n. In the general case they may depend on the sample point. There 
is nothing far-fetched in this; for instance phrases such as “pre-natal care,” 
“post-war construction”’ or “‘the day after the locusts’ contain an uncertain 
and therefore random date. When a gambler decides that he will bet on red 
“after black has appeared three times in a row,” he is dealing with X74, 
where the value of 7 is a matter of chance. However, it is essential to observe 
that these relative notions make sense by virtue of the way an optional T is 
defined. Otherwise if the determination of T involves the future as well as 
the past and present, then “‘pre-T”’ and “‘post-T”’ will be mixed up and serve 
no useful purpose. If the gambler can foresee the future, he would not need 
probability theory! In this sense an optional time has also been described as 
being “‘independent of the future’’; it must have been decided upon as an 
“option” without the advantage of clairvoyance. 

We can now-formulate the following extension of (8.3.3). For any optional 
I, any event A prior to T and any event B posterior to T, we have 


(8.5.4) P{B| Xp = i; A} = P{B| Xr =i}; 


+ E.g., suppose Xo = io + k, and take T = 7; — 1 in (8.5.5) below. Since Xr+, = k the 
equation cannot hold in general. 
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and in particular for any states 7 and /: 
(8.5.5) PL X41 = J| Xr = 13 A} = Pr}. 


This is known as the strong Markov property. It is actually implied by the 
apparently weaker form given in (8.3.3), hence also in the original definition 
(8.3.1). Probabilists used to announce the weak form and use the strong one 
without mentioning the difference. Having flushed the latter out in the open 
we will accept it as the definition for a homogeneous Markov chain. For a 
formal proof see [Chung 2; §1.13]. Let us observe that the strong Markov 
property was needed as early as in the proof of Theorem 2 of §8.2, where it 
was deliberately concealed in order not to sound a premature alarm: Now it 
is time to look back with understanding. 

To return to Theorem 7 we must now verify that the 7, used in the proof 
there is indeed optional. This has been effectively shown in (8.4.6), for the 
event 


{T, =n} = {X, ¥iforl|<v<n—-1;%X, = i} 


is clearly determined by {X;, ..., X,} only. This completes the rigorous 
proof of Theorem 7, to which we add a corollary. 
Corollary to Theorem 7. For any i and j, 


fé if jis recurrent, 
qu = ops 
Q if jis nonrecurrent. 


Proof: This follows at once from the theorem and the relation: 


(8.5.6) Quy = S099 


For, to enter j infinitely many times means to enter it at least once and then 
return to it infinitely many times. As in the proof of Theorem 8, the reasoning 
involved here is based on the strong Markov property. 

The next result shows the power of “‘thinking infinite.” 


Theorem 8. Jf i is recurrent and ina j, then 
(8.5.7) Gi = Qi = 1. 


Proof: The conclusion implies i vs j, and that 7 is recurrent by the corollary 
above. Thus the following proof contains a new proof of Theorem 6. 


Let us note that for any two events A and B, we have A C ABW B* and 
consequently 


(8.5.8) P(A) < P(B*) + P(AB). 


Now consider 
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A = {enter i infinitely often}, 
B = {enter j at least once}. 


Then P(A) = q.. = 1 by Theorem 7 and P,(B’) = 1 — fi. As for P,(AB) this 
means the probability that the particle will enter 7 at some finite time and 
thereafter enter i infinitely many times, because ‘“‘infinite minus finite is still 
infinite.’’ Hence if we apply the strong Markov property at the first entrance 
time into j, we have 


P(AB) = fiqn. 


Substituting into the inequality (8.5.8), we obtain 


l = Qu < I — fig +fian 
and so 


fis S fd. 


Since f, > 0 this implies 1 < q,., hence g,, = 1. Sinceg,, < fy it follows that 
fii = 1, and so j 4 i. Thus i and / communicate and therefore 7 is recurrent 
by (8.4.18). Knowing this we may interchange the roles of i and j in the 
preceding argument to infer qg,, = 1. 


Corollary. Jn a recurrent class (8.5.7) holds for any two states i and j. 


When the state space of a chain forms a single recurrent class, we shall 
call the chain recurrent; similarly for “‘nonrecurrent’’. The state of affairs for 
a recurrent chain described in the preceding Corollary is precisely that for a 
symmetric random walk in Theorem 2 of §8.2. In fact, the latter is a particular 
case aS we now proceed to show. 

We shall apply the general methods developed above to the case of ran- 
dom walk discussed in §8.1, namely Example 1 of §8.3. We begin by evaluat- 


ing pi’. This is the probability that the particle returns to its initial position i 


in exactly n steps. Hence pi”~” = 0 forn > 1; and in the notation of (8.1.2) 
(2n) _ f2n 


by Bernoulli’s formula (7.3.1), since there must be n steps to the right and n 
steps to the left, in some order. Thus we obtain the generating function 


(8.5.10) P,{z) = ay (7 )coaety 


Recalling the general binomial coefficients from (5.4.4), we record the pretty 
identity: 


(8.5.11) (5) — (Hl 3 n= YD _ (Ce " (>") 


2”-n! 22 n 
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where the second equation is obtained by multiplying both the denominator 
and numerator of its left member by 2”-n! = 2-4 --- (2n). Substituting into 
(8.5.10), we arrive at the explicit analytic formula: 


(8.5.12) P.{z) = x, (5? )(—40az'y = (1 — 4pqz?)-? 


r= 


where the second member is the binomial (Taylor’s) series of the third 
member. 
It follows that 


(8.5.13) > p® = P.{1) = lim P,(z) = lim (1 — 4pqz?)-"”. 
n=0 zftl ztl 


Now 4pq = 4p(1 — p) < 1 for 0 < p < 1; and = 1 if and only if p = 1/2 
(why ?). Hence the series above diverges if p = 1/2, and converges if p ¥ 1/2. 
By Theorem 5, / 1s recurrent if and only if p = 1/2. The calculations above do 
not depend on the integer 7 because of spatial homogeneity. Thus for p = 1/2 
the chain is recurrent; otherwise it is nonrecurrent. In other words, the ran- 
dom walk is recurrent if and only if it is symmetric. 

There is another method of showing this directly from (8.5.9), without the 


use of generating functions. For when p = ; we have 


(8.5.14) pe” = (*”) 


n ) 222 Joy 
by (7.3.6) as an application of Stirling’s formula. Hence by the comparison 


test for positive series, the series in (8.5.13) diverges because >~ a does. This 


n Vn 
method has the merit of being applicable to random walks in higher dimen- 
sions. Consider the symmetric random walk in R? (Example 10 of §8.3 with 
all four probabilities equal to 1/4). To return from any state (i, /) to (i, /) in 2n 
steps means that: for some k, 0 < k < a, the particle takes, in some order, 
k steps each to the east and west, and n — k steps each to the north and south. 
The probability for this, by the multinomial formula (6.4.6), is equal to: 


per, _ 25 : en Ca.) 
OD) “Aen kikian — k)\(n — k)! 


_ (Qn)! 3 ({) - (2n)! (*”) _ E (7) | 
~ Banint o\k) ~~ 4enini\n) ~ L2\n/ 4’ 


where in the penultimate equation we have used a formula given in Exercise 
28 of Chapter 3. The fact that this probability turns out to be the exact square 
of the one in (8.5.14) is a pleasant coincidence. [It is not due to any apparent 


(8.5.15) 
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independence between the two components of the walk along the two coordi- 
nate axes.]| It follows by comparison with (8.5.14) that 


(27) l 
(1.9) (4,9) —_— = @. 
2 Pw Xu _ 


Hence another application of Theorem 5 shows that the symmetric random 
walk in the plane as well on the line is a recurrent Markov chain. A similar 
but more complicated argument shows that it is nonrecurrent in R¢ for d > 3, 


because the probability analogous to that in (8.5.15) is bounded by mr (where 


, l ; 
cis aconstant), and >> Taj converges for d > 3. These results were first dis- 
n 


covered by Pélya in 1921. The non-symmetric case can be treated by using 
the normal approximation given in (7.3.13), but there the nonrecurrence is 
already implied by the strong law of large numbers as in R!; see §8.2. 

As another illustration, we will derive an explicit formula for fi? in case 


p= ; By (8.4.17) and (8.5.12), we have 


1 


F;,(z) = l ~ Pz) 


1-(— 2)”, 


Hence another expansion by means of binomial series gives 


n 


l * 1-3--- (Qn — 3 
= 97th a 
Thus fi?”~” = 0; and 
1 (2n l 
(2m) __ ; . 


by a calculation similar to (8.5.11). In particular we have 


Comparison with (8.5.14) shows that 


fa ~ 
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2) 


and so > nf = ©, This can also be gotten by calculating Fi(1). Thus, 
n=1 


although return is almost certain the expected time before return is infinite. 
This result will be seen in a moment to be equivalent to the remark made 
in §8.2 that e, = ©. 

We can calculate fi,” for any i and jin a random walk by a similar method. 
However, sometimes a combinational argument is quicker and more reveal- 
ing. For instance, we have 


(85.17) FBP = SPP ET KEY = PB = fH”. 


00 — 5) 
To argue this let the particle start from i and consider the outcome of its 
first step as in the derivation of (8.1.4); then use the symmetry and spatial 
homogeneity to get the rest. The details are left to the reader. 


8.6. Steady state 


In this section we consider a recurrent Markov chain, namely we suppose that 
the state space forms a single recurrent class. 

After the particle in such a chain has been in motion for along time, it will 
be found in various states with various probabilities. Do these settle down to 
limiting values? This is what the physicists and engineers call a “‘steady state’’ 
(distribution).+ They are accustomed to thinking in terms of an “ensemble” or 
large number of particles moving according to the same probability laws and 
independently of one another, such as in the study of gaseous molecules. In 
the present case the laws are those pertaining to a homogeneous Markov 
chain as discussed in the preceding sections. After a long time, the proportion 
(percentage) of particles to be found in each state gives approximately the 
steady-state probability of that state. [Note the double usage of the word 
“state” in the last sentence; we shall use “stationary” for the adjective 
“steady-state’’]. In effect, this is the frequency interpretation of probability 
mentioned in Example 3 of §2.1, in which the limiting proportions are taken 
to determine the corresponding probabilities. In our language, if the particle 
starts from the state i, then the probability of the set of paths in which it 
moves to state j at time n, namely {w | X,(w) = j}, is given by P,{X, = j} = 
pi;’. We are therefore interested in the asymptotic behavior of p” asn—> ©. 
It turns out that a somewhat more amenable quantity is its average value 
over a long period of time, namely: 


l 
n+ 1, 


n 1 2 
(8.6.1) >~ p? or —->d pP. 
= 0 Ny=1 


The difference between these two averages is negligible for large n but we shall 
use the former. This quantity has a convenient interpretation as follows. Fix 


} Strange to relate, they call a “distribution”? a “state’’! 
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our attention on a particular state 7 and imagine that a counting device re- 

cords the number of time units the particle spends in j. This is done by intro- 

ducing the random variables below which count 1 for the state / but 0 for any 
other state: 

1 if X, = J, 

bo(7) = ‘0 if X, #7 


We have used such indicators e.g. in (6.4.11). Next we put 
Noi) = Y 63) 


which represents the total occupation time of the state j in n steps. Now if E, 
denotes the mathematical expectation associated with the chain starting from i 
[this is a conditional expectation; see end of §5.2], we have 


EECA) = pip 
and so by Theorem 1 of §6.1: 


(8.6.2) E(NAD) = X Eli) = Xp? 


Thus the quantity in (8.6.1) turns out to be the average expected occupation 
time. 

In order to study this we consider first the case i = j and introduce the 
expected return time from j to j as follows: 


(8.6.3) mM, = ET;) = Xu? D 

where T, is defined in (8.4.2). Since j is a recurrent state we know that T, is 
almost surely finite, but its expectation may be finite or infinite. We shall see 
that the distinction between these two cases is essential. 

Here is the heuristic argument linking (8.6.2) and (8.6.3). Since the time 
required for a return is m,, units on the basis of expectation, there should be 
about n/m,, such returns in a span of time units on the same basis. In other 
words the particle spends about n/m,, units of time in the state 7 during the 
first n steps, namely E,(N,(/)) © n/m,,. The same argument shows that it 
makes no difference whether the particle starts from j or any other state i, 
because after the first entrance into j the initial i may be forgotten and we are 
concerned only with the subsequent returns from j to j. Thus we are led to the 
following limit theorem. 


Theorem 9. For any i and j we have 


; Ll S wt. 
Co pat Lito?! ~ im, 
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The argument indicated above can be made rigorous by invoking a general 
form of the strong law of large numbers (see §7.5), applied to the successive 
return times which form a sequence of independent and identically distributed 
random variables. Unfortunately the technical details are above the level of 
this book. There is another approach which relies on a powerful analytical 
result due to Hardy and Littlewood. [This is the same Hardy as in the Hardy- 
Weinberg theorem of §5.6.] It is known as a Tauberian theorem (after Tauber 
who first found a result of the kind) and may be stated as follows. 


Theorem 10. Jf A(z) = > QnZn, Where a, > 0 for all n and the series 
n=0 


converges forO < z < 1, then we have 


(8.6.5) tim 744 + [,2 > a, = lim (1 — z)A(z). 


To get a feeling for this theorem, suppose all a, = c > 0. Then 


‘6 
— Zz 


A(z) =c > z= i 
n=0 


and the relation in (8.6.5) reduces to the trivial identity 


l nm Cc 
pu c=e= (27 =, 


= 


Now take A(z) to be 


(2) 
F,,(z) 


where the last two equations come from (8.4.13) and (8.4.17). Then we have 


P.(Z) = x pi 2" = F,(z)P,,(z) = 


. . l—z l—z 
lim (1 — z)P,,(z) = F,(1)1 = i 
him ( Z)P.,,(Z) A ) lim 7G l-F.@ = IM 7 Fi Fa 


since F,,(1) = fi = 1 by Theorem 8 of §8.5. The last-written limit may be 
evaluated by I’Hospital rule: 


lim @O— 2 — 2 = lim ——~ = lL 
z—1 d —- F,,(z))’ zl — F;,(z) F;(1) 


cc / 99 


where stands for differentiation with respect to z. Since Fj,(z) = 


xe fz we have Fi(1) = x {? = m,,, and so (8.6.4) is a special case of 


(8.6.5). 
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We now consider a finite state space J in order not to strain our mathe- 
matical equipment. The finiteness of J has an immediate consequence. 


Theorem 11. Jf J is finite and forms a single class (namely if there are only a 
finite number of states and they all communicate with each other), then the 
chain is necessarily recurrent. 


Proof: Suppose the contrary; then each state is transient and so almost surely 
the particle can spend only a finite number of time units in it by Theorem 7. 
Since the number of states is finite, the particle can spend altogether only a 
finite number of time units in the whole space J. But time increases ad infini- 
tum and the particle has nowhere else to go. This absurdity proves that the 
chain must be recurrent. (What then can the particle do?) 


Let 7 = {1,2,...,/$, and put 
x =(%,..., X2) 
which is a row vector of / components. Consider the ‘‘steady-state equation” 
(8.6.6) x= xII or x(A—TII) =0 
where A is the identity matrix with / rows and / columns: A = (6,,), and II is 


the transition matrix in (8.3.9). This is a system of / linear homogeneous 
equations in / unknowns. Now the determinant of the matrix A — II 


1 — Pu — Pr ™ Pri 
— pa 1 — P22 — Pol 
— Pu — Pr | Pu 


is equal to zero because the sum of all elements in each row is | — )U p,, = 
J 


0. Hence we know from linear algebra that the system has a non-trivial solu- 
tion, namely one which is not the zero vector. Clearly if x is a solution then 


SO 1S CX = (CX, CX,..., CX) for any constant c. The following theorem 
identifies all solutions when J is a single finite class. 


We shall write 


I 


W; 
(8.6.7) M3; 


> below. 
I 


IE 


and >, for 
I 
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Theorem 12. If I is finite and forms a single class, then 
(¢) w is a solution of (8.6.6); 
(ii) Dw, = I; 
Jj 
(iii) w, > O for all j: 
(iv) any solution of (8.6.6) is a constant multiple of w. 
Proof: We have from (8.3.7), for every v > 0: 


(0+ 1) 


pit? = DO pi dyn. 
j 


Taking an average over v we get 


l n ; l n 
i ee len 2 P PP) Po 


(n-+1) 


The left member differs from = —— i, pe b Y= + i (pit? — pir) which 


tends to zero as n—> ~ ; hence its ime is equal to vw, by Theorem 9. Since J 

is finite we may let n—> © term by term in the right member. This yields 
= 2 WiP ik 

which is w = wlII; hence (i) is proved. We can now iterate: 

(8.6.8) w = wil = (WWIDI = wi? = WIDI? = wi =... 

to obtain w = wiII", or explicitly for n > 1: 

(8.6.9) We = pe W, Dik. 


Next we have pe pi? = 1 for every i and v > 1. Taking an average over v we 


obtain 


. (0) 
n + 1 LX Po L. 


It follows that 


o\ _ 4: “ () 
Ew, = tim (5 & pl?) = lim X 2 Pa = 


ne n>oN+1, 


where the second equation holds because J is finite. This establishes (11) from 
which we deduce that at least one of the w,’s, say w,, iS positive. For any k 
we have i »4 k and so there exists n such that p% > 0. Using this value of n 
in (8.6.9), we see that w, is also positive. Hence (iti) is true. Finally suppose x 
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is any solution of (8.6.6). Then x = xII* for every v > 1 by iteration as before, 
and 


+= 7b, 


by averaging. In explicit notation this is 


_ (v) 
m= Ex, (574 Dee) 


Letting n — © and using Theorem 10 we obtain 
x, = x,) We. 
(Es) 


Hence (iv) is true with c = >- x,. Theorem 12 is completely proved. 


Jj 
We call {w,, 7 © I} the stationary (steady-state) distribution of the Markov 
chain. It is indeed a probability distribution by (ii). The next result explains 
the meaning of the word “stationary.” 


Theorem 13. Suppose that we have for every j, 
(8.6.10) P{X = j} = w,, 


then the same is true when X) is replaced by any X,,n > 1. Furthermore the 
joint probability 


(8.6.11) P{Xnwo = jv,09S0< B 


for arbitrary j,, is the same for alln > 0. 
Proof: We have by (8.6.9), 


P{X, = j} = DP{Xo = PPX, = J} = Dwpy? = wy. 
Similarly the probability in (8.6.11) is equal to 


P{X, = Jo} Pion 88 Djragy = WyoPyon °° *° Pyrasi 


which is the same for all n. 

Thus, with the stationary distribution as its initial distribution, the chain 
becomes a stationary process as defined in §5.4. Intuitively, this means that 
if a system is in its steady state, it will hold steady indefinitely there, that is, 
so far as distributions are concerned. Of course changes go on in the system, 
but they tend to balance out to maintain an over-all equilibrium. For in- 
stance, many ecological systems have gone through millions of years of transi- 
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tions and may be considered to have reached their stationary phase—until 
human intervention abruptly altered the course of evolution. However, if the 
new process is again a homogeneous Markov chain as supposed here, then 
it too will settle down to a steady state in due time according to our theorems. 

The practical significance of Theorem 12 is that it guarantees a solution 
of (8.6.6) which satisfies the conditions (ii) and (iii). In order to obtain this 
solution, we may proceed as follows. Discard one of the / equations and solve 
the remaining equations for w.,..., w; in terms of w,. These are of the form 
Ww, = CW, 1 <j < /, where c,; = 1. The desired solution is then given by 


»1<j<l 


Example 13. A switch may be on or off; call these two positions states 1 and 
2. After each unit of time the state may hold or change, but the respective 
probabilities depend only on the present position. Thus we have a homo- 
geneous Markov chain with J = {1, 2} and 


I = ke P| 
Pox =Pre 
where all elements are supposed to be positive. The steady-state equations are 


ad — Pu)X1 — Pix = 0, 
— P21 + (1 — poe)x. = 0. 


Clearly the second equation is just the negative of the first and may be dis- 
carded. Solving the first equation we get 


—1i-Pu x = 22 x 


Xe 1 
P21 P21 


Thus 


P21 Pr 
W= — > Ww = 
; Pre + Pa , Pi + Pa 


In view of Theorem 9, this means: in the long run the switch will be on or 
off for a total amount of time in the ratio of poi: Pie. 


Example 14. At a carnival Daniel won a prize for free rides on the merry-go- 
round. He therefore took “infinitely many” rides but each time when the bell 
rings he moves onto the next hobby-horse forward or backward, with proba- 
bility p or g = 1 — p. What proportion of time was he on each of these 
horses? 


8.6. Steady state 281 


Figure 34 


This may be described as “random walk on a circle.” The transition matrix 
looks like this: 


0 p00 -:--- 0 0 |g 
gq 0 p O 0 0 0 
0 gq 0 p 0 0 0 
0 0 0 0 p 

0 0 g 0 p 
p 9000 -:-+- 0 q 0 


The essential feature of this matrix is that the elements in each column (as 
well as in each row) add up to one. In general notation, this means that we 
have for every j € I: 


(8.6.12) Sea 
EI 


Such a matrix is called doubly stochastic. Now it is trivial that under the con- 
dition (8.6.12), x = (1, 1,..., 1) where all components are equal to one, is 
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a solution of the equation (8.6.6). Since the stationary distribution w must 
be a multiple of this by (iv) of Theorem 12, and also satisfy (111), it must be 


wa(ppe ;) 
APP Ye 


where as before / is the number of states in /. This means if Daniel spent 4 
hours on the merry-go-round and there are 12 horses, his occupation time of 
each horse is about 20 minutes, provided that he changed horses sufficiently 
many times to make the limiting relation in (8.6.4) operative. 

For a recurrent Markov chain in an infinite state space, Theorem 12 must 
be replaced by a drastic dichotomy as follows: 


(a) either all w, > 0, then (ii) and (ii1) hold as before, and Theorem 13 1s 
also true; 
(b) or all w; = 0. 


The chain is said to be positive-recurrent (or strongly ergodic) in case (a), and 

null-recurrent (or weakly ergodic) in case (b). The symmetric random walk 

discussed in §8.1-2 is an example of the latter (see Exercise 38 below). It can 

be shown (see [Chung 2; §I.7]) that if the equation (8.6.6) has a solution 

xX = (%1, X%,...) satisfying the condition 0 < >> |x,| < , then in fact all 
Jj 


x; > 0 and the stationary distribution is given by 


ir JEL 


J 


The following example illustrates this result. 


Example 15. Let J = {0,1,2,...}, and p,, = 0 for |i — j| > 1, whereas the 
other p,,’s are arbitrary positive numbers. These must then satisfy the 
equation 


(8.6.13) Piga t+ Pig + Pri = 1 


for every j. This may be regarded as a special case of Example 7 in §8.3 with 
Po,_1 = 0 and a consequent reduction of state space. It may be called a simple 
birth-and-death process (in discrete time) in which jis the population size and 
j-j+1orj—j— 1 corresponds to a single birth or death. The equation 
(8.6.6) becomes: 


Xo = XoPoo + X1P105 
8.6.14 ; 
( ) X, = XjAPs-1,3 + X5P35 + XpHPi4+1,55 J = 1. 


This is an infinite system of linear homogeneous equations, but it is clear that 
all possible solutions can be obtained by assigning an arbitrary value to x, 
and then solve for x, x2... successively from the equations. Thus we get 
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Po 
xX; = — Xo, 
Pro 


x } = Pol = Pu = Pro), _ PoPr 


l 
2= — (ull — —xX 
Dnt i( Pu) oPo P21Pr10 ° PioP21 ° 


It is easy to guess (perhaps after a couple more steps) that we have in general 


_ Pop >: ‘Pin, j > 1. 


8.6.15 X, = C;X) Where c= 1, c 
( ) ,] j“0 0 ,) PoP °° * Py ja =< 


To verify this by induction, let us assume that p,,j;1x, = Pj-1,,X,-1; then we 
have by (8.6.14) and (8.6.13): 


Di4,5X%ju = CL — pyy)x, — Pipi 
= (1 — pj — DPjs-a)Xi = P5,541%X3. 


Hence this relation holds for all 7 and (8.6.15) follows. We have therefore 
(8.6.16) vx = (= <i) Xo, 
j=0 0 


and the dichotomy cited above is as follows, provided that the chain is recur- 
rent. It is easy to see that this is true in case (a). 


Case (a). If > Cc, < «©, then we may take x) = 1 to obtain a solution 
satisfying >~ x, < oo, Hence the chain is positive-recurrent and the stationary 
distribution is given by 

Cj 


[+ @] 
6 
7=0 


Ww, = > j= 0. 


Case (b). If > c, = ©, then for any choice of Xo, either >> |[x,| = © or 
7=0 j 
>, |x;| = 0 by (8.6.16). Hence w, = 0 for all j > 0, and the chain is null-re- 
dj 


current or transient. 

The preceding example may be modified to reduce the state space to a 
finite set by letting p...41 = Oforsomec > 1. A specific case of this is Example 
8 of §8.3 which will now be examined. 


Example 16. Let us find the stationary distribution for the Ehrenfest model. 
We can proceed exactly as in Example 15, leading to the formula (8.6.15) but 
this time it stops at 7 = c. Substituting the numerical values from (8.3.16), 
we obtain 
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= aes) (4) , 
Oo = 1-2---j ~ jp 9SISe 


We have 3 c, = 2° from (3.3.7); hence 
7=0 


This is just the binomial distribution B (« ;) 


Thus the steady state in Ehrenfest’s urn may be simulated by coloring the 
c balls red or black with probability 1/2 each, and independently of one 
another; or again by picking them at random from an infinite reservoir of red 
and black balls in equal proportions. 

Next, recalling (8.7.6) and (8.7.3), we see that the mean recurrence times 


are given by 
Cc —1 
M,, = 2¢ (‘) » 0 


For the extreme cases j = 0 (no black ball) and j = c (no red ball) this is 
equal to 2° which is enormous even for c = 100. It follows (see Exercise 42) 
that the expected time for a complete reversal of the composition of the urn 
is very long indeed. On the other hand, the chain is recurrent; hence starting 
e.g. from an urn containing all black balls, it is almost certain that they will 
eventually be all replaced by red balls at some time in the Ehrenfest process, 
and vice versa. Since the number of black balls can change only one at atime the 
composition of the urn must go through all intermediate “‘phases” again and 
again. The model was originally conceived to demonstrate the reversibility of 
physical processes, but with enormously long cycles for reversal. “If we wait 
long enough, we shall grow younger again!”’ 

Finally, let us describe without proof a further possible decomposition of 
a recurrent class. The simplest illustration is that of the classical random walk. 
In this case the state space of all integers may be divided into two subclasses: 
the even integers and the odd integers. At one step the particle must move 
from one subclass to the other, so that the alternation of the two subclasses is 
a deterministic part of the transition. In general, for each recurrent class C 
containing at least two states, there exists a unique positive integer d, called 
the period of the class, with the following properties: 


(a) for every i € C, ps? = Oif d{ nt; on the other hand, p{?” > 0 for all 
sufficiently large n (how large depending on 1); 


IA 


iS<e. 


t “d}n” reads “‘d does not divide n’’. 
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(b) for every i € C andj € C, there exists an integer r, 1 < r < d, such 
that p”’ = 0 if d}n —r; on the other hand, p{"“t” > 0 for all 
sufficiently large n (how large depending on i and /). 


Fixing the state 7, we denote by C, the set of all states 7 associated with the 
same number r in (b), for 1 < r < d. These are disjoint sets whose union is 
C. Then we have the deterministic cyclic transition: 


Cy Op COLO C. 


Here is the diagram of such an example with d = 4: 


4 
Figure 35 


where the transition probabilities between the states are indicated by the 
numbers attached to the directed lines joining them. 

The period d of C can be found as follows. Take any i € C and consider 
the set of all n > 1 such that pi? > 0. Among the common divisors of this 
set there is a greatest one: this is equal to d. The fact that this number is the 
same for all choices of i 1s part of the property of the period. Incidentally, the 
decomposition described above holds for any class which is stochastically 
closed (see §8.7 for definition); thus the free random walk has period 2 
whether it is recurrent or transient. 

When d = | the class is said to be aperiodic. A sufficient condition for 
this is: there exists an integer m such that all elements in II” are positive. 
For then it follows from the Chapman-Kolmogorov equations that the same 
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is true of II” for all m > m, and so property (a) above implies that d = 1. 
In this case the fundamental limit theorem given in (8.6.4) can be sharpened 
as follows: 


(8.6.17) lim py? = +. 


or) 49 


namely the limit of averages may be replaced by a strict individual limit. In 
general if the period is d, and i and j are as in (b) above, then 


(8.6.18) lim petro = 4. 


no My, 


We leave it to the reader to show: granted that the limit above exists, its value 
must be that shown there as a consequence of (8.6.4). Actually (8.6.18) follows 
easily from the particular case (8.6.17) if we consider d steps at a time in the 
transition of the chain, so that it stays in a fixed subclass. The sharp result 
above was first proved by Markov who considered only a finite state space, 
and was extended by Kolmogorov in 1936 to the infinite case. Several different 
proofs are now known; see [Chung 2; §1.6] for one of them. 


8.7. Winding up (or down?) 


In this section we shall give some idea of the general behavior of a homo- 
geneous Markov chain when there are both recurrent and transient states. 
Let R denote the set of all recurrent states, T the set of all transient states, so 
that 7 = RUT. We begin with a useful definition: a set of states will be called 
[stochastically] closed iff starting from any state in the set the particle will 
remain forever in the set. Here and hereafter we shall omit the tedious repeti- 
tion of the phrase “almost surely”? when it is clearly indicated. The salient 
features of the global motion of the particle may be summarized as follows. 


(1) A recurrent class is closed. Hence, once the particle enters such a class 

it will stay there forever. 

(ii) A finite set of transient states is not closed. In fact, starting from such 
a set the particle will eventually move out and stay out of it. 

(ii) If T is finite then the particle will eventually enter into the various 
recurrent classes. 

(iv) In general the particle will be absorbed into the recurrent classes with 
total probability a, and remain forever in T with probability 1 — a, 
where 0 < a < 1. 


Let us prove assertion (i). The particle cannot go from a recurrent state 
to any transient state by Theorem 6; and it cannot go to any recurrent state 
in a different class because two states from different classes do not communi- 
cate by definition, hence one does not lead to the other by Theorem 8 if these 
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states are recurrent. Therefore from a recurrent class the particle can only 
move within the class. Next, the truth of assertion (11) is contained in the 
proof of Theorem 11, according to which the particle can only spend a finite 
number of time units in a finite set of transient states. Hence from a certain 
instant on it will be out of the set. Assertion (iii) is a consequence of (11) and is 
illustrated by Example 3 of §8.3 (gambler’s ruin problem). Assertion (iv) states 
an obvious alternative on account of (i), and is illustrated by Example | of 
§8.3 with p > 1/2, in which case a = 0; or by Example 9. In the latter case 
it is clear that starting from i > 1, either the particle will be absorbed in the 
state 0 with probability fi) in the notation of (8.4.6); or it will move steadily 
through the infinite set of transient sets {i + 1,i+ 2,...} with probability 
1 — fi. 

Let us further illustrate some of the possibilities by a simple numerical 
example. 


Example 17. Let the transition matrix be as follows: 


123 45 6 
13 lil |! 
Tle g aig 07° © 
Loo: did 
2)0 5 0:0 515 0 
; va 
|PRosLog 
(8.7.1) Seeenees ee EOE 


iN 
bee Re! 
© Nol = 


The state space may be finite or infinite according to the specification of Ro, 
which may be the transition matrix of any recurrent Markov chain such as 
Example 4 or 8 of §8.3, or Example 1 there with p = 1/2. 

Here T = {1, 2,3}, Ri = {4, 5} and R, are two distinct recurrent classes. 
The theory of communication between states implies that the four blocks of 
0’s in the matrix will be preserved when it is raised to any power. Try to con- 
firm this fact by a few actual schematic multiplications. On the other hand, 
some of the single 0’s will turn positive in the process of multiplication. There 
are actually two distinct transient classes: {1, 3} and {2}; it is possible to go 
from the first to the second but not vice versa. [This is not important; in fact, 
a transient class which is not closed is not a very useful entity. It was defined 
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to be a class in §8.4 only by the force of circumstance!] All three transient 
states lead to both R,; and R., but it would be easy to add another which leads 
to only one of them. The problem of finding the various absorption prob- 
abilities can be solved by the general procedure below. 

Let i © T and C be a recurrent class. Put for n > 1: 


(8.7.2) yw? = YY p® = P{X, € Ch. 
EC 


This is the probability that the particle will be in C at time n, given that it 
starts from i. Since C is closed it will then also be in C at time n + 1; thus 
yi” < yi"*” and so by the monotone sequence theorem in calculus the limit 
exists as n—> 00: 


(n) 


y. = lim yi” = P,{X, € C for somen > I}, 


N— 


(why the second equation?) and gives the probability of absorption. 


Theorem 14. The {y,} above satisfies the system of equations: 


(8.7.3) X= DS PyX, + ¥ py, iC T. 
JET IEC 


If T is finite, it is the unique solution of this system. Hence it can be computed 
by standard method of linear algebra. 


Proof: Let the particle start from i, and consider its state j after one step. 
If 7 © T, then the Markov property shows that the conditional probability 
of absorption becomes y,; if 7 € C, then it is already absorbed; if j © 
(J -- T) — C, then it can never be absorbed in C. Taking into account these 
possibilities, we get 


yi = D Puy; + » Put l + >. Pr °9. 
j€T 7€C I€(I-T)-C 


This proves the first assertion of the theorem. Suppose now T is the finite set 
{1, 2,..., 7¢}. The system (8.7.3) may be written in matrix form as follows: 


(8.7.4) (Ar — IIr)x = y™, 


where Ar is the identity matrix indexed by T X T; JIr is the restriction of IT 
on TX T, and y*” is given in (8.7.2). According to a standard result in linear 
algebra, the equation above has a unique solution if and only if the matrix 
Ar — JIr is nonsingular, namely it has an inverse (Arp — JI r)~!, and then the 
solution is given by 


(8.7.5) x= (Ar _— IIr)71y™. 
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Suppose the contrary, then the same result asserts that there is a nonzero 
solution to the associated homogeneous equation. Namely there 1s a column 
vector v = (%,..., 04) X (0,..., 0) satisfying 


(Ar — JIr)v = 0, or v = JI7. 
It follows by iteration that 


v = TIr(rv) = Izv = 7MIrr) = Tv =..., 
and so for everyn > I: 


v = JI] 72. 


[cf. (8.6.8) but observe the difference between right-hand and left-hand multi- 
plications. |] This means 


v= Dd piv, iC T. 
jET 


Letting n — © and using the Corollary to Theorem 5 we see that every term 
in the sum converges to zero and so v, = 0 for all i © T, contrary to hypothe- 
sis. This contradiction establishes the nonsingularity of Ar — II7r and conse- 
quently the existence of a unique solution given by (8.7.5). Since {y,, i © T} 
is a solution the theorem is proved. 

For Example 17 above, the equations in (8.7.3) for absorption prob- 
abilities into R, are: 


1 3 J I 
Maem ty wt gxwt | 
I I 

y= 5 + 3 
I 3 2 
nr in  anet 


We get x2 at once from the second equation, and then x,, x3; from the others: 


_% ,_2 , _ 2 
337 5 33 


_ 
For each i, the absorption probabilities into R, and R, add up to one, hence 
those for R. are just 1 — x, 1 — x2, 1 — x3. This is the unique solution to 
another system of equations in which the constant terms above are replaced 
by 0, 7 a You may wish to verify this as it is a good habit to double-check 


these things, at least once in a while. 
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It is instructive to remark that the problem of absorption into recurrent 
classes can always be reduced to that of absorbing states. For each recurrent 
class may be merged into a single absorbing state since we are not interested 
in the transitions within the class; no state in the class leads outside, whereas 
the probability of entering the class at one step from any transient state i is 
precisely the yj” used above. Thus, the matrix in (8.7.1) may be converted 
to the following one: 


1 1 

; gaa 9 

] 1 1 
0 7°35 6 
13 921 
5 10 5 10 
0 00 1 O 
0 00 0 +1 


in which the last two states {4} and {5} take the place of R,; and R:. The 
absorption probabilities become just fi and /%; in the notation of (8.4.6). The 
two systems of equations remain of course the same. 

When 7 is finite and there are exactly two absorbing states there is another 
interesting method. As before let T = {1, 2,..., 7} and let the absorbing 
states be denoted by 0 and¢ + 1, sothat/ = {0,1,...,¢-+ 1}. The method 
depends on the discovery of a positive nonconstant solution of the equation 
(A — I])x = 0, namely some such v = (%, 4, . . . , Veq1) Satisfying 

t+1 
(8.7.6) v, = Li Pasds i=0,1, ...,¢+1. 

j= 
Observe that the two equations for i = 0 and i = ¢+ 1 are automatically 
true for any v, because Po, = 60, aNd Pisi,, = 6441,,; also that v, = 1 1s always 
a solution of the system, but it is constant. Now iteration yields 


for all n > 1; letting n— © and observing that 


lim pf? =0 forl<j<z; 


lim pi = ft forj=0 and j=t+1; 
we obtain 
(8.7.7) vv, = Fiovo + fit 1Vi+1- 


Recall also that 
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(8.7.8) 1 = fio + fist. 


We claim that v9 ¥ 0441, otherwise it would follow from the last two equations 
that v; = v for all i, contrary to the hypothesis that v is nonconstant. Hence 
we can solve these equations as follows: 


v, — Uv Vo —- UV 
(8.7.9) fio = —, fia =. 


0 Vet Vo — Ve41 


Example 18. Let us return to Problem 1 of §8.1, where ¢ = c— 1.If P¥4Q, 
then v, = (g/p)’ is a nonconstant solution of (8.7.6). This is trivial to verify 
but you may well demand to know how on earth did we discover such a 
solution? The answer in this case is easy (but motivated by knowledge of 
difference equations used in §8.1): try a solution of the form \* and see what 
’ must be. Now if we substitute this v, into (8.7.9) we get fj equal to the wu; in 
(8.1.9). 


Ifp=q= * then v, = 71s a nonconstant solution of (8.7.6) since 
ae or 1. 
(8.7.10) i= 5 +1)+ 5 (i — 1). 


This leads to the same answer as given in (8.1.10). The new solution has to do 
with the idea of a martingale (see Appendix 3). Here is another similar 
example. 


Example 19. The following model of random reproduction was introduced by 
S. Wright in his genetical studies (see e.g. [Karlin] for further details). In 
a haploid organism the genes occur singly rather than in pairs as in the diploid 
case considered in §5.6. Suppose 2N genes of types A and a (the alleles) are 
selected from each generation. The number of A-genes is the state of the 
Markov chain and the transition probabilities are given below: J = 
{0,1,...,2N}, and 


any m= NY) 


Thus if the number of A-genes in any generation is equal to i, then we may 
suppose that there is an infinite pool of both types of genes in which the 
proportion of A to ais as i: 2N — i, and 2N independent drawings are made 
from it to give the genes of the next generation. We are therefore dealing 
with 2N independent Bernoullian trials with success probability i/2N, which 
results in the binomial distribution B(2N; i/2N) in (8.7.11). It follows that 
(see (4.4.16) or (6.3.6)) the expected number of A-genes is equal to 


8.7.12 Spi i _j 
(8.7.12) 2, Pal = 2N 55, = Fe 
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This means that the expected number of A-genes in the next generation is 
equal to the actual (but random) number of these genes in the present 
generation. In particular, this expected number remains constant through 
the successive generations. The situation is the same as in the case of a fair 
game discussed in §8.2 after (8.2.3). The uncertified trick used there is again 
applicable and in fact leads to exactly the same conclusion except for nota- 
tion. However, now we can also apply the proven formula (8.7.9) which gives 
at once 


.- 2N-i ,, _ i 
fis = TN > fiw = Fy 


These are the respective probabilities that the population will wind up being 
pure a-type or A-type. 

Our final example deals with a special but important kind of homogeneous 
Markov chain. Another specific example, queuing process, is outlined with 
copious hints in Exercises 29-31 below. 


Example 20. A subatomic particle may split into several particles after a 
nuclear reaction; a male child bearing the family name may have a number 
of male children or none. These processes may be repeated many times unless 
extinction occurs. These are examples of a branching process defined below. 


Figure 36 
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There is no loss of generality to assume that at the beginning there is 
exactly one particle: X) = 1. It gives rise to X, descendants of the first 
generation, where 


(8.7.13) P(X, =j/)=a, 7=0,1,2,.... 


Unless X; = 0, each of the particles in the first generation will give rise to 
descendants of the second generation, whose number follows the same 
probability distribution given in (8.7.13), and the actions of the various 
particles are assumed to be stochastically independent. What is the distri- 
bution of the number of particles of the second generation? Let the generating 
function of X; be g: 


g(z) = Qo a,z?. 
7=0 


Suppose the number of particles in the first generation is equal to j/, and we 
denote the numbers of their descendants by Z;,... , Z, respectively. Then by 
hypothesis these are independent random variables each having g as its 
generating function. The total number of particles in the second generation is 
X, = Z, + -:- + Z, and this has the generating function g’? by Theorem 6 
of §6.5. Recalling (6.5.9) and the definition of conditional expectation in 
(5.2.11), this may be written as follows: 


(8.7.14) E(z** | X = j) = g(z)’, 


and consequently by (5.2.12): 
E(z*?) = p> P(X, = EZ*?| XM = jf) = 2 a,g(z)’ = g(g(z)). 


Let g, be the generating function of X, so that g, = g; then the above says 
8 = 2(g:). Exactly the same argument gives gn = 9(2n-1) = 2@°2°°''' 02 
(there are n appearances of g), where ‘‘.”’ denotes the composition of func- 
tions. In other words g, is just the n-fold composition of g with itself. Using 
this new definition of gn, we record this as follows: 


(8.7.15) g,(z) = E(zX*) = x P(X, = kz". 


Since the distribution of the number of descendants in each succeeding 
generation is determined solely by the number in the existing generation, 
regardless of past evolution, it is clear that the sequence {X,, > 0} has the 
Markov property. It is a homogeneous Markov chain because the law of 
reproduction is the same from generation to generation. In fact, it follows 
from (8.7.14) that the transition probabilities are given below: 
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(8.7.16) Pj. = coefficient of z* in the power series for g(z)’. 

To exclude trivial cases, let us now suppose that 

(8.7.17) 0<a<agata <l. 


The state space is then (why?) the set of all nonnegative integers. The preced- 
ing hypothesis implies that all states lead to 0 (why?) which is an absorbing 
state. Hence all states except 0 are transient but there are infinitely many of 
them. The general behavior under (iii) at the beginning of the section does 
not apply and only (iv) is our guide. [Observe that the term “‘particle’’ was 
used in a different context there.| Indeed, we will now determine the value 
of a which is called the probability of extinction in the present model. 

Putting z = 0 in (8.7.15) we see that g,(0) = p{?; on the other hand our 
general discussion about absorption tells us that 


(8.7.18) a = lim pio = lim g,(0). 


Since g,(0) = 9(gn-1(0)), by letting n — «© we obtain 
(8.7.19) a= g(a). 


Thus the desired probability is a root of the equation ¢(z) = 0 where ¢(z) = 
g(z) — 2; we shall call it simply a root of ¢. Since g(1) = 1, one root is z = 1. 
Next we have 


e"@) = 8" = & i — agi? > 0 


for z > 0, on account of (8.7.17). Hence the derivative ¢’ is an increasing 
function. Now recall Rolle’s theorem from calculus: between two roots of a 
differentiable function there is at least one root of its derivative. It follows 
that ¢ cannot have more than two roots in [0, 1], for then g’ would have more 
than one root which is impossible because ¢’ increases. Thus ¢ can have at 
most one root different from 1 in [0, 1], and we have two cases to consider. 


Case 1. ¢ has no root in [0, 1). Then since ¢(0) = a > 0, we must have 
¢g(z) > 0 for all z in [0,1), for a continuous function cannot take both 
positive and negative values in an interval without vanishing somewhere. 
Thus we have 


el) — oz) < eI = 0, O<z< 1; 


and it follows that 


+ It is customary to draw two pictures to show the two cases below. The reader is invited to 
do this and see if he is more readily convinced than the author. 
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ol(l) = m A= 6 el) <0: 


hence g’(1) < 1. 


Case 2. ¢ has a unique root r in [0;1). Then by Rolle’s theorem gy’ must 
have aroot sin [r, 1), i.e., ¢’(s) = g’(s) — 1 = 0, and since g’ is an increasing 
function we have 


gl) > gs) = 1. 


To sum up: the equation g(z) = z has a positive root less than 1 if and 
only if g’(1) > 1. 

In Case 1, we must have a = I sinceO < a < 1 and aisa root by (8.7.19). 
Thus the population is almost certain to become extinct. 

In Case 2, we will show that a is the root r < 1. For g(0) < g(r) =r; 
and supposing for the sake of induction g,_,(0) < r, then g,(0) = 2(gn-1(0)) < 
g(r) = r because g 1s an increasing function. Thus g,(0) < r for all n and so 
a <r by (8.7.18). But then a must be equal to r because both of them are 
roots of the equation in [0, 1). 

What will happen in Case 2 if the population escapes extinction? Accord- 
ing to the general behavior under (iv) it must then remain forever in the 
transient states {1,2,...} with probability 1 — a. Can its size sway back 
and forth from small to big and vice versa indefinitely? This question is 
answered by the general behavior under (ii), according to which it must stay 
out of every finite set {1,2,..., ¢} eventually, no matter how large ¢ is. 
Therefore it must in fact become infinite (not necessarily monotonically, but 
as a limit), namely: 


pilim X, = +o | Xn # 0 for all n} — 1. 


no 


The conclusion is thus a “boom or bust”’ syndrome. The same is true of 
the gambler who has an advantage over an infinitely rich opponent (see §8.2): 
if he is not ruined he will also become infinitely rich. Probability theory 
contains a lot of such extreme results some of which are known as Zero-or-one 
(“all or nothing’’) laws. 

In the present case there is some easy evidence for the conclusions 
reached above. Let us compute the expectation of the population of the nth 
generation. Let » = E(X,) be the expected number of descendants of each 
particle. Observe that » = g’(1) so that we have uw < 1 in Case 1 andy > 1 
in Case 2. Suppose » < ©; then if the number of particles in the n-Ist 
generation is j, the expected number of particles in the nth generation will be 
Ju (why?). Using conditional expectation, this may be written as, 


E{X, | Xia = j} = my. 
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It follows from (5.2.12) that 
E(X,) =F wiP(Xna = jf) = wB(%-2) 
IJ = 


and consequently by iteration 
E(X,) = pw" E(X) = uw". 


Therefore we have 


0, ifu<l, 
lim E(X,,) = lim w= l, if u = l, 
o, ifu> Il. 


This tends to support our conclusion in Case 1 for certain extinction; in fact 
it is intuitively obvious that if « < 1 the population fails to be self-replacing 
on the basis of averages. The case » = | may be disposed of with a bit more 
insight, but let us observe that here we have the strange situation that 
E(X,) = 1 for all n, but P (lim xX, = 0) = | by Case 1. In case » > 1 the 


nae 


crude intrepretation would be that the population will certainly become 
infinite. But we have proved under Case 2 that there is a definite probability 
that it will die out as a dire contrast. This too is interesting in relating simple 
calculations to more sophisticated theory. These comments are offered at 
the closing of this book as an invitation to the reader for further wonderment 
about probability and its meaning. 


Exercises 


1. Let X, be as in (8.1.2) with X) = 0. Find the following probabilities: 
(a) P{X, > 0 for n = 1, 2, 3, 4}, 

(b) P(X, ~ 0 for n = 1, 2, 3, 4}, 
(c) P{X, < 2 for n = 1, 2, 3, 4}, 
(d) P{|X,| < 2 for n = 1, 2, 3, 4}. 

2. Let Y, = X2, where X, is as in No. 1. Show that {Y,,n > 0} is a 
Markov chain and find its transition matrix. Similarly for {Z,,n > 0} 
where Z, = Xoni1; what is its initial distribution? 

3. Let a coin be tossed indefinitely; let H, and 7, denote respectively the 
numbers of heads and tails obtained in the first n tosses. Put X, = Ah, 
Y, = H, — Tn. Are these Markov chains? If so find the transition 
matrix. 

4.* As in No. 3 let Z, = |H, — T,|. Is this a Markov chain? [Hint: com- 
pute eg. P{Y¥2, = 2i|Z, = 2i} by Bernoulli’s formula, then 
P{Zong1 = 2i%~ 1 | Zon = 2i, Yon > O}.] 


I 
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5. 


10. 


Il. 


Let the transition matrix be given below: 


1d Pm 0 
2 2 (b) | 0 ps @& 

@l,, gs O ps 
3 3 


Find fi?, fi, gid’ for n = 1, 2, 3 (for notation see (8.4.6) and (8.4.10)). 
In a model for the learning process of a rat devised by Estes, the rodent 
is said to be in state | if it has learned a certain trick (to get a peanut 
or avoid an electric shock), and to be in state 2 if it has not yet learned 
it. Suppose that once it becomes learned it will remain so, while if it is 
not yet learned it has a probability a of becoming so after each trial 
run. Write down the transition matrix and compute p3?, fs? for all 
n = 1; and my (See (8.6.3) for notation). 

Convince yourself that it is a trivial matter to construct a transition 
matrix in which there are any given number of transient and recurrent 
classes, each containing a given number of states, provided that either 
(a) J is infinite, or (b) J is finite but not all states are transient. 

Given any transition matrix II, show that it is trivial to enlarge it by 
adding new states which lead to old ones, but it is impossible to add any 
new state which communicates with any old one. 

In the “double or nothing” game, you bet all you have and you have a 
fifty-fifty chance to double it or lose it. Suppose you begin with $1 and 
decide to play this game up to n times (you may have to quit sooner 
because you are broke). Describe the Markov chain involved with its 
transition matrix. 

Leo is talked into playing heads in a coin-tossing game in which the 
probability of heads is only 0.48. He decides that he will quit as soon 
as he is one ahead. What is the probability that he may never quit? 

A man has two girl friends, one uptown and one downtown. When he 
wants to visit one of them for a weekend he chooses the uptown girl 
with probability p. Between two visits he stays home for a weekend. 
Describe the Markov chain with three states for his weekend where- 
abouts: “uptown,” “home” and ‘‘downtown.” Find the long-run 
frequencies of each. [This is the simplest case of Example 4 of §8.3, 
but here is a nice puzzle related to the scheme. Suppose that the man 
decides to let chance make his choice by going to the bus stop where 
buses go both uptown and downtown and jumping aboard the first bus 
that comes. Since he knows that buses run in both directions every 
fifteen minutes, he figures that these equal frequencies must imply 
p = 1/2 above. But after a while he realizes that he has been visiting 
uptown twice as frequently as downtown. How can this happen? This 
example carries an important lesson to the practicing statistician, 
namely that the relevant datum may not be what appears at first right. 
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16. 
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Assume that the man arrives at the bus stop at random between 6 p.m. 
and 8 p.m. Figure out the precise bus schedules which will make him 
board the uptown buses with probability p = 2/3.] 

Solve Problem 1 of §8.1 when there is a positive probability r of the 
particle remaining in its position at each step. 

Solve (8.1.13) when p ¥ qg as follows. First determine the two values 
Ai and de such that x, = d? Is a solution of x, = px,41 + gx,1. The 
general solution of this system is then given by AX, + BAS where A 
and B are constants. Next find a particular solution of x, = px,4; + 
qx;-1 + | by trying x, = Cj and determine the constant C. The general 
solution of the latter system is then given by AM + Bd + Cj. Finally 
determine A and B from the boundary conditions in (8.1.13). 

The original Ehrenfest model is as follows. There are a total of N balls 
in two urns. A ball is chosen at random from the 2N balls from either 
urn and put into the other urn. Let X, denote the number, of balls in a 
fixed urn after n drawings. Show that this is a Markov chain having the 
transition probabilities given in (8.3.16) with c = 2N. 

A scheme similar to that in No. 14 was used by Daniel Bernoulli [son 
of Johann, who was younger brother of Jakob] and Laplace to study 
the flow of incompressible liquids between two containers. There are 
N red and N black balls in two urns containing N balls each. A ball is 
chosen at random from each urn and put into the other. Find the 
transition probabilities for the number of red balls in a specified urn. 
In certain ventures such as doing homework problems one success 
tends to reinforce the chance for another by imparting experience and 
confidence; in other ventures the opposite may be true. Anyway let us 
assume that the after effect is carried over only two consecutive trials so 
that the resulting sequence of successes and failures constitutes a 
Markov chain on two states {s, f}. Let 


Dss = &, Prt = B, 


where a and @ are two arbitrary members between 0 and 1. Find the 

long-run frequency of successes. 

The following model has been used for the study of contagion. Suppose 

that there are N persons some of whom are sick with influenza. The 

following assumptions are made: 

(a) when a sick person meets a healthy one, the chance is a that the 
latter will be infected; 

(b) all encounters are between two persons; 

(c) all possible encounters in pairs are equally likely; 

(d) one such encounter occurs in every (chosen) unit of time. 

Define a Markov chain for the spread of the disease and write down its 

transition matrix. [Are you overwhelmed by all these oversimplifying 
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19. 


20. 


21. 


assumptions? Applied mathematics is built upon the shrewd selection 
and exploitation of such simplified models. | 

The age of a light bulb is measured in days and fractions of a day do 
not count. If a bulb is burned out during the day then it is replaced by 
a new one at the beginning of the next day. Assume that a bulb which 
is alive at the beginning of the day, possibly one which has just been 
installed, has probability p of surviving at least one day so that its age 
will be increased by one. Assume also that the successive bulbs used 
lead independent lives. Let X) = 0 and X, denote the age of the bulb 
which 1s being used at the beginning of the n + Ist day. (We begin with 
the first day, thus X; = 1 or O according as the initial bulb is still in 
place or not at the beginning of the second day.) The process { Xn, > O} 
is an example of a renewal process. Show that it is a recurrent Markov 
chain, find its transition probabilities and stationary distribution. [ Note: 
the life span of a bulb being essentially a continuous variable, a lot of 
words are needed to describe the scheme accurately in discrete time, 
and certain ambiguities must be resolved by common sense. It would 
be simpler and clearer to formulate the problem in terms of heads and 
tails in coin-tossing (how?), but then it would have lost the flavor of 
application! | 

Find the stationary distribution for the random walk with two reflecting 
barriers (Example 4 of §8.3). 

In a sociological study of “‘conformity” by B. Cohen, the following 
Markov chain model was used. There are four states: S; = consistently 
nonconforming, S_. = indecisively nonconforming, $3; = indecisively 
conforming, S,; = consistently conforming. In a group experiment 
subjects were found to switch states after each session according to the 
following transition matrix: 


Si So S3 S4 


Si I 0 0 0 
So | 06 .76 ~ «18 0 
S3 0 27 §=©.69 ~~ .04 
Si 0 0 0 l 


Find the probabilities of ultimate conversion from the “conflict’’ states 
S. and §; into the “resolution” states S; and S4. 

In a genetical model similar to Example 19 of §8.7, we have J = 
{0,1,...,2N} and 


_ (7 oe — V/ (Cy): 
Pu\ GW Nj N 
How would you describe the change of genotypes from one generation 
to another by some urn scheme? Find the absorption probabilities. 
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2N 
[Hint: compute >> jp., by simplifying the binomial coefficients, or by 
7=0 


Theorem 1 of §6.1.] 

For the branching process in Example 20 of §8.7, if ao, a: and a, are 
positive but the other a,’s are all zero, find the probability of extinction. 
Suppose that the particles in the first generation of a branching process 
follow a probability law of splitting given by {b,;, 7 > 0} which may be 
different from that of initial particle given by (8.7.13). What then is the 
distribution of the number of particles in the second generation? 

A sequence of electric impulses is measured by a meter which records 
the highest voltage that has passed through it up to any given time. 
Suppose that the impulses are uniformly distributed over the range 
{1,2,...,¢}. Define the associated Markov chain and find its transi- 
tion matrix. What is the expected, time until the meter records the 
maximum value ¢? [Hint: argue as in (8.1.13) for the expected absorption 
time into the state f; use induction after computing e;_, and e;-».| 

In proof-reading a manuscript each reader finds at least one error. But 
if there are 7 errors when he begins, he will leave it with any number 
of errors between 0 and j — 1 with equal probabilities. Find the ex- 
pected number of readers needed to discover all the errors. [Hint: 
e,=j-"e, + --- + .e,-1) + 1, now simplify e, — e,_1.| 

A deck of m cards may be shuffled in various ways. Let the state space 
be the m! different orderings of the cards. Each particular mode of 
shuffling sends any state (ordering) into another. If the various modes 
are randomized this results in various transition probabilities between 
the states. Following my tip (a) in §3.4 for combinatorial problems, 
let us begin with m = 3 and the following two modes of shuffling: 

(i) move the top card to the bottom, with probability p; 

(ii) interchange the top and middle cards, with probability 1 — p. 
Write down the transition matrix. Show that it is doubly stochastic and 
all states communicate. Show that if either mode alone is used the states 
will not all communicate. 

Change the point of view in No. 26 by fixing our attention on a par- 
ticular card, say the queen of spades if the three cards are the king, 
queen and knight of spades. Let X, denote its position after  shufflings. 
Show that this also constitutes a Markov chain with a doubly stochastic 
transition matrix. 

Now generalize Nos. 26 and 27: for any m and any randomized shuffling, 
the transition matrices in both formulations are doubly stochastic. 
[Hint: each mode of shuffling as a permutation on m cards has an 
inverse. Thus if it sends the ordering 7 into k then it sends some ordering 
i into j. For fixed 7 the correspondence i = i(k) is one-to-one and 
Pi. = P,x. This proves the result for the general case of No. 26. Next 
consider two orderings j/, and j, with the fixed card in the topmost 
position, say. Each mode of shuffling which sends /; into an ordering 
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with the given card second from the top does the same to j,. Hence the 
sum of probabilities of such modes is the same for j; or je, and gives 
the transition probability 1— 2 for the displacement of the card in 
question. | 

Customers arrive singly at a counter and enter into a queue if it is 
occupied. As soon as one customer finishes the service for the next 
customer begins if there is anyone in the queue, or upon the arrival of 
the next customer if there is no queue. Assume that the service time is 
constant (e.g., a taped recording or automatic hand-dryer), then this 
constant may be taken as the unit of time. Assume that the arrivals 
follow a Poisson process with parameter a in this unit. For n > 1 let 
X, denote the number of customers in the queue at the instant when 
the nth customer finishes his service. Let {Y,, > 1} be independent 
random variables with the Poisson distribution a(a); see §7.1. Show 
that 


Xa = (Xi, — It+ Yr, n2= 1; 


where xt = x if x > O and xt = 0 if x < 0. Hence conclude that 
{Xn,n > 1} is a Markov chain on {0,1,2,...} with the following 
transition matrix: 


Co Ci Co C3 
Co Ci Co C3 
O Co C1 Ce 
0 0 @ 


eee 8 ee eee 


where c, = 7,(a). [Hint: this is called a queuing process and {X,,n > 1} 
is an imbedded Markov chain. At the time when the nth customer finishes 
there are two possibilities. (i) The queue is not empty; then the n + Ist 
customer begins his service at once and during his (unit) service time Y, 
customers arrive. Hence when he finishes the number in the queue is 
equal to X, — 1 + Y,. (ii) The queue is empty; then the counter is free 
and the queue remains empty until the arrival of the n + Ist customer. 
He begins service at once and during his service time Y, customers 
arrive. Hence when he finishes the number in the queue is equal to Y,. 
The Y,,’s are independent and have z(q) as distribution, by Theorems 1 
and 2 of §7.2.] 

Generalize the scheme in No. 29 as follows. The service time is a random 
variable S such that P{S = k} = b,, k > 1. Successive service times 
are independent and identically distributed. Show that the conclusions 
of No. 29 hold with 


C= >» byw, (ka). 
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In No. 29 or No. 30, let p = ay jc,. Prove that the Markov chain is 
j= 


transient, null-recurrent, or positive-recurrent according as yu < 1, 
u = 1 orp > 1. [This result is due to Lindley; here are the steps for a 
proof within the scope of Chapter 8. In the notation of §8.4 let 


Ful2) =£@, a2) =X os? 


(a) F;,;(z) = f(z) for all j > 1; because e.g. ff7-1 = P{Y, > 1, ¥,+ 
Yrti > 2, Y,, + Yau + Yn42 = 3, Y, + Yn4t + Yn+2 + Yn43 = 

(b) F,(z) = f(z)’ for j > 1, because the queue size can decrease only 
by one at a step; 


[os] 
(c) fi0 = co, fio = u c, fy” for v > 2; hence 
jz 


f(@) = co2 + & e@Fyd2) = 20 FO); 


(d) Foz) = zg(f(z)) by the same token; 

(ec) if f(1) = p, then p is the smallest root of the equation p = g(p) 
in [0,1]; hence Foo(1) = f(1) < 1 or =1 according as g’(1) > 1 
or <1 by Example 4 of §8.7; 

(f) f’) = f’C)e’C1) + g(1); hence if g’/(1) < 1 then in the notation 
of (8.6.3), moo = Foo(1) = f’(1) = © or <@ according as g’(1) = 1 
of <1. Q.E.D. 

For more complicated queueing models see e.g., [Karlin]. 

A company desires to operate s identical machines. These machines 

are subject to failure according to a given probability law. To replace 

these failed machines the company orders new machines at the beginning 

of each week to make up the total s. It takes one week for each new order 

to be delivered. Let X, be the number of machines in working order 

at the beginning of the nth week and let Y, denote the number of 

machines that fail during the nth week. Establish the recursive formula 


Xap = S— Y,, 


and show that {X,, > 1} constitutes a Markov chain. Suppose that the 
failure law is uniform, i.e.: 


1 


»j7=0,1, ...,3. 


Find the transition matrix of the chain, its stationary distribution, and 
the expected number of machines in operation in the steady state. 
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In No. 32 suppose the failure law is binomial: 
PU. = j/k = = (1) oC py, f= Obi 


with some probability p. Answer the same questions as before. [These 

two problems about machine replacement are due to D. Iglehart. | 

The matrix [p,;|, i€ I, 7 € J is called substochastic iff for every i we 

have 2, Py <1. Show that every power of such a matrix is also 
Jj 


substochastic. 
Show that the set of states C is stochastically closed if and only if for 
every i€ Cwe have >> p,, = 1 

9€C 


Show that 


O0<n<0% 


max P,{X, = j} <P, {0 [Xn =A} < = Pr = j}. 


Hence deduce that ing j if and only if f% > 0. 
Prove that if g,, > 0, then x pip =. 


Prove that if j wi, then gi, < ©. Give an example where gi; = ~. 
[Hint: show that gi n) Si? <fi#t” and choose v so that f{? > 0.] 

Prove that if there exists j such that in j but not/ ~ i, then 71s transient. 
[Hint: use Theorem 9; or argue as in the proof of Theorem 9 to get 
Qu < py -O+ (1 — pi?)-1 for every n.] 

Define for arbitrary i, jand k in Jandn > 1: 


pP = P{X, Ak forl<o<n—1; X, = jh. 


Show that if k = j this reduces to fy”, while if k = i it reduces to gi”. 
In general, prove that 


(n) (m) __ (a+ m) 
2, rDry kPo = kPu 


These are called taboo probabilities because the passage through k 
during the transition is taboo. 

If the total number of states is r, and 7 ~w j, then there exists n such that 
1<n<vprand pj > 0. [Hint: any sequence of states leading from i 
to j in which some k occurs twice can be shortened. | 

Generalize the definition in (8.6.3) as follows: 


Mm; = E{T;) = x of i. 
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Prove that m,, + m,, > m,, for any two states i and j. In particular, in 
Example 16 of §8.6, we have mo. > 2°7}, 

Prove that the symmetric random walk is null-recurrent. [Hint: pg? = 
Pi +--+ +& =j— i}; use (7.3.7) and the estimate following it.] 
For any state i define the holding time in i as follows: S = 
max {(n > 1| X, = i,forally = 1,2,...,n}. Find the distribution of S. 
Given the Markov chain {X,, > 1} in which there is no absorbing 
state, define a new process as follows. Let n, be the smallest value of 1 
such that X, 4 Xi, m2 the smallest value >m, such that X, ¥ X,,, ng 
the smallest value >n_. such that X, # X,, and so on. Now put Y, = 
X,,; show that {Y,,v > 1} is also a Markov chain and derive its 
transition matrix from that of {X,,n > 1}. Prove that if a state is 
recurrent in one of them, then it is also recurrent in the other. 

In the notation of No. 3, put Hy? = oe H,. Show that {X,} does not 
form a Markov chain but if we define a process whose value Y, at time 
n is given by the ordered pair of states (X,_1, Xn), then {Y,,n > 1} is 
a Markov chain. What is its state space and transition matrix? The 
process {H,”,n > 0} is sometimes called a Markov chain of order 2. 
How would you generalize this notion to a higher order? 

There is a companion to the Markov property which shows it in reverse 
time. Let {X,\ be a homogeneous Markov chain. For n > 1 let B be 
any event determined by Xn4i, Xny2,.... Show that we have for any 
two states 7 and /: 


P{Xn-1 = j | Xn = i; BY = P{Xn1 = j| Xn = 


but this probability may depend on n. However, if {X,} is stationary 
as in Theorem 13, show that the probability above is equal to 


~ _ WiPnr 
ty 
WwW, 


and so does not depend on n. Verify that [ f,,] is a transition matrix. 
A homogeneous Markov chain with this transition matrix is said to be a 
reverse chain relative to the original one. 


Appendix 3 
Martingale 
Let each X, be a random variable having a finite expectation, and for sim- 


plicity we will suppose it to take integer values. Recall the definition of 
conditional expectation from the end of §5.2. Suppose that for every event A 


determined by X,..., Xn—1 alone, and for each possible value i of X,, we 
have 
(A.3.1) E{ Xn | A; Xn = i} = 7; 


then the process {X,, 1 > 0} is called a martingale. This definition resembles 
that of a Markov chain given in (8.3.1) in the form of the conditioning, but 
the equation is a new kind of hypothesis. It is more suggestively exhibited 
in the symbolic form below: 


E{Xnai | X05 Xi, ry Xn} = Xne 


This means: for arbitrary given values of X, %,..., Xn, the conditional 
expectation of X,4; 1s equal to the value of X,, regardless of the other values. 
The situation is illustrated by the symmetric random walk or the genetical 
model in Example 19 of §8.7. In the former case, if the present position of 
the particle is X,, then its position after one step will be XY, + 1 or X, — 1 
with probability 1/2 each. Hence we have, whatever the value of X,: 


E(Xne| Xo} = 5 (at D+5 (Xa 1) = Xs 


furthermore this relation remains true when we add to the conditioning the 
previous positions of the particle represented by Xo, X%i,..., Xn—1. Thus the 
defining condition (A.3.1) for a martingale is satisfied. In terms of the gambler, 
it means that if the game is fair then at each stage his expected gain or loss 
cancel out so that his expected future worth is exactly equal to his present 
assets. A similar assertion holds true for the number of A-genes in the 
genetical model. More generally, when the condition (8.7.6) is satisfied, then 
the process {v(X,) n > 0} constitutes a martingale, where v is the function 
i— (i), i€ J. Finally in Example 20 of §8.7, it 1s easy to verify that the 
normalized population size {X,/u",n > O} is a martingale. 

If we take A to be an event with probability one in (A.3.1), and use 
(5.2.12), we obtain 
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E(Xnu1) =X P(Xn = DE(Xnaa | Xo = 1) 
(A.3.2) ; 
= > P(X, = ii = ECX,). 


Hence in a martingale all the random variables have the same expectation. 
This is observed in (8.2.3), but the fact by itself is not significant. The follow- 
ing result from the theory of martingales covers the applications mentioned 
there and in §8.7. Recall the definition of an optional random variable from 
§8.5. 


Theorem. If the martingale is bounded, namely if there exists a constant 
M such that |X,| < M for all n, then for any optional T we have 


(A.3.3) E(Xr) = E(X). 


In the case of Problem 1 of §8.1 with p = 1/2, we have |X,| < c; in the 
case of Example 3 of §8.3 we have |X,| < 2N. Hence the theorem is applic- 
able and the absorption probabilities fall out from it as shown in §8.2. 

The extension of (A.3.2) to (A.3.3) may be false for a martingale and an 
optional 7, without some supplementary condition such as boundedness. 
In this respect, the theorem above differs from the strong Markov property 
discussed in §8.5. Here is a trivial but telling example for the failure of 
(A.3.3). Let the particle start from 0 and let T be the first extrance time into 1. 
Then T is finite by Theorem 2 of §8.2, hence Xr is well defined and must 
equal | by its definition. Thus E(Xr) = 1 but E(X)) = 0. 

Martingale theory was largely developed by J. L. Doob (1910- ) and 
has become an important chapter of modern probability theory; for an 
introduction see [Chung 1; Chapter 9]. 
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. n= -~m > 3; Dn = 
Pp x 2 2 13 p 


ANSWERS TO PROBLEMS 


Chapter I 


(AU BYBU C) = ABC + ABC? + AcBC + AcBC* + ABPC; A\B = 


ABC + ABC’; {the set of w which belongs to exactly one of the sets 
A, B,C} = ABC* + A°BC* + A°BC. 


. The dual is true. 

. Define A # B = A°U Bs, or A°() B. 

. Lae = La — Late; La-ew = La — Tp. 

»dausue = la + Ip + Ic — Ing — Inc — Iso + Iaze. 


Chapter 2 


» PWC, + Cr) S P(r) + PCC). 

» POS) + Sx + S3 + Su) 2 POS1) + PCS2) + PCS3) + PCS)). 

. Take AB = @, P(A) > 0, P(B) > 0. 

. 17. 

. 126. 

.|AU BUC = |A| + |B] + |C| — [AB] — |AC| — |BC| + |ABC. 
. P(A A B) = P(A) + P(B) — 2P(AB) = 2P(A U B) — P(A) — P(B). 
. Equality holds when m and n are relatively prime. 


l 
n(n + 1)” > l. 
14 


° 60 
. If A is independent of itself, then P(A) = 0 or P(A) = 1; if A and Bare 


disjoint and independent then P(A)P(B) = 0. 


. Pip2q3P4qs Where p, = probability that the kth coin falls heads, 


gq. = 1 — px. The probability of exactly 3 heads for 5 coins is equal to 
> PkiPkPkGkGkr; Where the sum ranges over the 10 unordered triples 
(ky, ko, ks) of (1, 2, 3, 4, 5) and (ks, ks) denotes the remaining unordered 
pair. 


Chapter 3 


~34+2;34+2+4+6 X 2). 
#777 TY 
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3. Three shirts are delivered in two different packages each of which may 
contain 0 to 3. If the shirts are distinguishable: 2°; if not: 


243-1) 


-3X4X3KSK353K4XKX 34+) XK C+ 1) XS. 
. 26? + 26%; 100. 


» 4's; 2 & 414. 


20 
, (; ); (20);. 
10. 35 (0 sum being excluded); 23. 


It. ; if the missing ones are as likely to be of the same size as of different 


sizes; ; if each missing one is equally likely to be of any of the sizes. 


244 2x4! ; 4. 

12. 3361 OF er according as the two keys are tried in one or both orders 
(how is the lost key counted ?). 

13. 20216 (by enumeration); some interpreted “steadily increasing” to mean 
“forming an arithmetical progression,”’ if you know what that means. 


14. (a) 1/63; (b) {6X 1+ 90 X 3+ 120 X 6}/6°. 


15. ({) a1; (3) (3) 3! 
16. 1 1(6) (4) +) (3) +) @) +. G)) + (@) (0) }/ Ga) 


17. From an outside lane: 3/8; from an inside lane: 11/16. 


18. (m1), (m — Dn 
0 


m (M)n 


19. 666 


6 
18 14 18 

20. 4/(15)s (11) / (is). 

21. Assuming that neither pile is empty: (a) both distinguishable: 2!° — 2; 
(b) books distinguishable but piles not: (2!° — 2)/2;(c) piles distinguish- 
able but books not: 9; (d) both indistinguishable: 5. 

10! 10! 4! lO: |b! 
" 3rgiaiat’ 31giaiai ~* Dror? 313taiai * Dra 


23. (a) ({5) (75) (75 )180)1/(366 i, 
(b) (305)so/(366)eo 


(10) / (so) 
(" 3 )Cr)/ (100) 


2 


> 


25. 


MN 
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27. 


29. 


30. 


OC NN 


Divide the following numbers by (5); 
(a) 4X (>). (b) 9X 4°; () 4X9; 
(d) 13 X 48; (2) 13K 12K 4x 6. 
Divide the following numbers by 6°: 


6! 6! 5\ 6! 
oe a Sa 2% G) Xa 


6 6! 6 6! 


6 6!  /6\ (4 6! (6 6! 
(5) x 212!12!° ( (*) x F191? (‘) (5) xX AP 6! 


Add these up for a check; use your calculator if you have one. 

Do the problem first for n = 2 by enumeration to see the situation. In 
general, suppose that the right pocket is found empty and there are k 
matches remaining in the left pocket. For 0 < k < n the probability of 
this event is equal to sack (°" , *) 7 This must be multiplied by 2 
because right and left may be interchanged. A cute corollary to the solu- 
tion is the formula below: 


no | 2n—k 
D so ( n )=1, 


k=0 


Chapter 4 


-P{X+ Y=k} = 1/3 fork = 3, 4,5; same for Y+ ZandZ+ X. 
-P{X+ Y—Z=k} = 1/3 fork = 0, 2, 4; 


P{V(X? + YYZ = x} = 1/3 for x = V13, V15, V20; 
P{Z/|X — Y| = 3} = 1/3, P{Z/|X — Y| = 1} = 2/3. 


. Let P@,) = 1/10 for j = 1, 2; =1/5 for j = 3, 4; =2/5 for j = 5; 


X(,) =jforl<j< 5; Yw,) = V3 for j = 1, 4; =a for j = 2, 5; 
=V2 for j = 3. 


. Let P&,) = p,, X(,) =v, 1<j<n. 

-1X+ Y = 7} = {C1, 6), (2, 5), GB, 4), (4, 3), 6, 2), (6, 1}. 

. P{Y = 14000 + 4n} = 1/5000 for 1 < n < 5000; E(Y) = 24002. 
. P{Y = 11000 + 3n} = 1/10000 for 1 < n < 1000; 


P{Y = 10000 + 4n} = 1/10000 for 1001 < n < 10000; 
E(Y) = 41052.05. 


. E(Y) = 29000 + 7000.e~?”. 
. Aer, x > 0. 


, ov (x?), x >-0; 


- for Va < x < Vb. 


5 
vx) +f(-V}, x > 0. 
7 SVD + /(-V9)}, x > 0 
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15. c = 1/1 — q”). 

"MM 
16. POY =f)=[n+J Fn for —n < j <n Such that n + / 1s even. 

2 


E(Y) = 0. 
no nacm~(") (82 )/(Bhesss 


18. If there are r rotten apples in a bushel of ” apples and k are picked at 
random, the expected number of rotten ones among those picked is equal 
to kr/n. 


19. P(X > m) 


20. 1. 
21. Choose v, = (—1)"2"/n, pa = 1/2". 
23. According to the three hypotheses on p. 97: (1) V3 /2; (2) 3/4; 3) 2/3. 
24. 1. 

26. Fir) = os falr) = £ for 0 <r < 100; E(R) = 2. 
26. Flt) = Fog Se) = 56 for OS 1S 100; ~ 3 


27. Y = dtan 6, where d is the distance from the muzzle to the wall and 6 is 
the angle the pistol makes with the horizontal direction. 


PCY < y) = arctan’; E(Y) = +0, 


28. EQ*) = +0. 
29. If at most m tosses are allowed, then his expectation is m cents. 


—1 
31. P(X, Y) = (m, m’)) = (5) for 1<m<m <n; P(X =m) = 


I 
m ECX) = +00, 


(n — m5) 3 P(Y = m') =(m'— n(5) 3 P(Y-X=k= 


n\7!} 
(n— (5) 51SkSn=1. 


2i1f0<u<v<l; 


32. Joint density of (X, Y) is f(u, v) = 0. otherwise 


Chapter 5 


1050 95_ 
° 6145 1095, 
18826 
* 19400 
5/9. 
. (a) 1/2; (b) 1/10. 
. 1/4; 1/4. 
1/4. 


Nn fw LW 
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7. 
8. 280 — a+ gs). 


s© 


27. 


28. 
29. 
30. 


31. 
33. 


34, 


35. 


39. 
40. 


—" 


1/2. 


6/11, 3/11, 2/11. 


. 400/568. 
1/2. 


p+ 5 PX — p). 

. 59/80. 

. P(no unbrella | rain) = 2/9; P(no rain | umbrella) = 5/9. 
. 43/84. 

. 5/13. 

. (a) 3/8; (b) 3/4; (c) 1/3. 


of (£0 () 280) Spm snss 


. The probabilities that the number is equal to 1, 2, 3, 4, 5, 6 are equal 


respectively to: 

(1) pt; (2) pips + pipe; (3) pips + 2pips + pips: 

(4) 2p, pops + ps + 3pipops; (5) 2pop3 + 3p:p3Ps + 3pip3; 

(6) pops + pops + 6pipeps; (7) 3pips + 3p2p3; (8) 3p2p3; (9) p3. Tedious 
work? See Example 20 and Exercise No. 23 of Chapter 8 for general 


method. 
4\" 3\" 2\" 
(5) ~2(6) + (8) 
pz 9?" (5, ps) eyP™ 
2/7. 
P(maximum < y| minimum < x) = y?/(2x — x*) ify < x; 
=(2xy — x?)/(2x — x*) if y > x. 
1/4. 
(a) (r+ ¢)/( +7 + 2c); (b) + 2c)/(6 + r + 2c); (©), (©), (f): 
(r+ c)/(6+r-+c); (d): same as (b). 
{bx(b2 + Dn + bird + 1) + nbn — I+ ntre + Dn} / 
(b1 + 1)°(b2 + re + 1). 


(>, ke) / N (= ke). 


(1 + p)?/4; (1 + pq)/2. 


0 1 2 
O|g p 0 
1 | q/2 1/2 p/2 
2|0 q_ p 


Chapter 6 


. $.1175; $.5875. 
. $94000; $306000. 
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3 2 4 3 4 
3. (B+totututa) 
4, 21; 35/2. 
5. .5; 2.5. 
aS. BN 782 
13/4: at (3) / (Gs) 
6 25 
7. (6/7); 7 {1 = (5) } 


. (a) 1 — (364/365)5° — 500(364)199/(365) 5°. 
(b) 500/365. 


65 {1 — (254) 


(d) 365 p where p is the number in (a). 


a 


oO 


(m — 1)-* 


9. Expected number of boxes getting k tokens is equal to m (;) mn 


; ; m— 1\"— 
expected number of tokens alone in a boxisequalto n ( - ) 


10. P(n, tokens in jth box for 1 <j < m) = ——_~ — 
le ° m 


where 1m; + --- +My, = Nn. 
11. 49. 
12. 7/2. ; 
13. 100p; 10V pC — p). 
14. 46/5. 
N)n—1 


15. .(a) N+ 1; (b) = es, 
16. Let M denote the maximum. With replacement: 


P(M = ky = SEO cen: 


non § {1 ()} 


without replacement: 


pm = k= (Fo 1) / (Minsk sm 


n(N + 1) 
E(M) = “nad 
17. (a) nr/(b + 7); (b) (7? + br + cnr)/(b + 1). 
19. 1/p. 
22. E(X) = I/n. 


23. AT) = 2+ +—4; our) = $+ HS - (f+ +—#)" 
be 
24. K(T|T > n) = "Ua. 


25. (a) 1/5d; (b) 137/60). 
26. .4%. 
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27. 
28. 


29. 


30. 


31. 
32. 


33. 


34. 
sm 


36. 
37. 


38. 


Rw bw 


E(aX + b) = aE(X) + 5, o%(aX + 5b) = a@’o(X). 

Probability that he quits winning is 127/128, having won $1; probability 
that he quits because he does not have enough to double his last bet is 
1/128, having lost $127. Expectation is zero. So is it worth it? In the 
second case he has probability 1/256 of losing $150, same probability of 
losing $104, and probability 127/128 of winning $1. Expectation is still 
zero. 

E (maximum) = n/(n + 1); E (minimum) = 1/7 + 1); 

E (range) = (n — 1)/™-+ 1). 


9(2) ~ pit (q; + DZ); g’(1) _ > Dj. 


u. = P{S, < k}; g(z) = @ + pz)"/CU1 — 2z). 
e(z) = (1 — 2°94)/QN + 1241 — z); g’(l) = 0. 


() i l 
4n 
j N-1 ]| 
g(z) = 2% Ty J e(I)=N>D TF 
— jz 7=0 J 


= 8 ): ry = = gl) + 8’); ms = 3") + 3g") + 8’); 
= gO(1) + 6g") + 7g") + 8). 


a ILO) 


(a) (1 — e-*)/e, X > 0; (b) 271 — e~® — che~®)/c?d2, AX > O; 
(c) A+ I. 
Laplace transform of S, is equal to u"/(A + wu)”; 


b 
Pia< 8S, < 6b) = won f u’—le— du. 


Chapter 7 


_ 3 o-2/3 
| ze 3. 


4 °° —Il1 


. e %akt/k! where a = 1000/324. 


e210 Se (20) 
k=20 kK! 
. Let a, = 4/3, a, = 2. P{X, = j| % + X = 2} __ 2! aan? 
N(2 — Js)! (a1 + a)? 
for j = 0, 1, 2. 


. If (7 + 1)p is not an integer, the maximum term of B,(n; p) occurs at 


k = [+ l)p] where [x] denotes the greatest integer not exceeding x; 
if (2 + 1)p is an integer, there are two equal maximum terms for k = 
(n+ 1)p — l and( + I)p. 


316 Answers to Problems 


7. If aw is not an integer, the maximum term of m,(a@) occurs at k = [a]; 
if w is an integer, atk = a— 1landk=a. 
8. exp [—Ac + a(e” — 1)]. 
9. w(a + 8). 
11. e-*? 3 oo. 
k= 50 


12. (n ef ur—le—“/2 du. 


nas) *-a) 


Vn 
14. Find n such that 26 (a ‘") — 1 > .95. We may suppose p > 1/2 (for 


the tack I used.) 
15. 537. 
16. 475. 


—_— 2/2. 
24. V5. Tnx e 


27. P{6(t) > us = e~™; P{i(t) > u} = e™ foru<t; = Oforu at. 
28. No! 


Chapter 8 


1. (a) p* + 3p%q + 2p’q?; (b) p* + 2p*q + 2pq* + q; 
(c) 1 — (p* + p’q); (d) 1 — (p* + p’g + pq’ + @’). 

2. For Y,: / = the set of even integers; Po. ..42 = P?, Pon = 2p9q, 
Prin. = q’; 
For Z,: J = the set of odd integers; p2,-1,0.41 = D?, P—i.—1 = 2D9, 
Prins = G3 P{Zo = If = p, P{Zo = —l} = 4. 

3. For X,: / = the set of nonnegative integers; p,.. = 4, Piri = PD. 
For Y,: / = the set of all integers, p,..1 = 4, Pina = DP. 

4. P{|Yonsil = 2+ Al [Yon] = 21} = (pet + GP Y/C™ +g), 
P{| Yong] = 26 — 1] | Yon| = 27} = (pq + pq??)/(p?* + 9’"). 


5. 1 2 3 2 3 
1 1 1 
26 9 0 age 
Iil ; 
> 4 8 Pig: Piqi 
112 2 
4 3.9 MP2 pe 
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| 
6. |" l ° pee = (1 — a)""a; px? = 1 — (1 — @)*37m = 


a 


9. 7= {(0;2,0<i<n-} 
l l ; 
Pa, a = 5, P20 = 5 forOSisn— I. 


10. 1/13. 
l. H D 1 

| wo = 5, wn = 5, Wo = § 
U;}0o 1 0 
Hip 0 4q 
D|0 1 O 

12. Same as given in (8.1.9) and (8.1.10). 

13. e =r’) q 


J 
= 77>? 7 2 where r = =- 
7 (p-gl-r) p- Pp 
15. Pw = (i/N)’, Pin = 2i(N — i)/N?, Pir = ((N _ i)/N); 


N\?2 2N 
= ° ] < 
m= GY / (yO SiS 
16. ws = (1 — B)/2 — a — B); wy = (1 — a)/Q22 — a — 8). 
18. Dyin = PPro = 1—prw, = p'g,0sjs~. 
c~1 
(19. Let r = p/q, AP =1+p? Dd r*+ re; then wo = A; we = pu'r*A, 
k= 
l<k<ce-—1lj;w= raA. 
20. fi = .721; f31 = .628. 
* _ a * __ Zs . 
21. fiaw = sap Sio l ay? OSES 2N. 
22. (1 —a,+ Vai — 2a, + 1 — 4anar)/2ar. - 
23. Coefficient of z? in g(h(z)) where A(z) = > b,2?. 
7==0 


24. Let e, denote the expected number of further impulses received until the 
meter registers the value /, when the meter reads /j; then e, = / for 
l<j</-le=0. 


25. Cm = > I 
g=1 J 
26. (123) (32) (13) (231) (12) (21) 
(123) 0 0 q Pp 0 0 
(132) 0 0 0 0 Pp 
(213) q Pp 0 0 0 0 
(231) 0 0 0 0 Pp q 
(312) p q 0 0 0 0 
(321) 0 0 p gq 0 0 


27, 1 2 3 
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32. pj = 1/G4+ 1) fors—i<j< s; = 0 otherwise; 
ws = AP + DMs + Is + 2) for0<j<s; x jw; = 28/3. 
j= 


33. Py = (, f ,) pU — p) + for s —i<j < 5s; = 0 otherwise; 


_fs 1 \if p \*3 ; (ee 
= (4) (G35) (FR) tro sss Em = 9 +p. 
44. P{S = k| X, = i} = pil — pt), k> 1. 
45. Py = pu/Q —_ Pu) for i FI;pu= 0. 
46. P{(Xn, Xnus) = (k, 2k — f+ 1) | Xi Xx) = i, O} = p, 
P{(X,, Xn41) = (k, 2k —j) | (X,-1, Xn) = (Cj, k)} = q. 


Let HY? = 3 H?, then {H} is a Markov chain of order 3, etc. 
v=1 


Table 1 


Values of the standard 
normal distribution function 
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Table I Values of the standard normal distribution function 


I 


Oa) = |" Fenn du = PU sx) 


V 2n 


x 0 l 2 3 4 5 6 7 8 9 
= 0013 .0010 .0007 .0005 .0003 .0002 .0002 .0001 .0001 .0000 
=2:9 0019 .0018 .0017 .0017 .0016 .0016 .0015 .0015 .0014 .0014 
12.0 0026 .0025 .0024 .0023 .0023 .0022 .002!1 .0020 .0020 .0019 
—2.7 0035 .0034 .0033 .0032 .0031 .0030 .0029 .0028 .0027 .0026 
— 2.6 0047 .0045 .0044 .0043 .0041 .0040 .0039 .0038 .0037 .0036 
=2) 0062 .0060 .0059 .0057 .0055 .0054 .0052 .0051 .0049 .0048 
—2.4 0082 .0080 .0078 .0075 .0073 .0071 .0069 .0068 .0066 .0064 
= 2:9 0107 .0104 .0102 .0099 .0096 .0094 .0091 .0089 .0087 .0084 
22 0139 .0136 .0132 .0129 0126 .0122 .O119 .O116 .O113 .O110 
—2.1 0179 .0174 .0170 .0166 .0162 .0158 .0154 .0150 .0146 .0143 
— 2.0 0228 .0222 .0217 .0212 .0207 .0202 .0197 .0192 .0188 .0183 
—1.9 0287 .0281 .0274 .0268 .0262 .0256 .0250 .0244 .0238 .0233 
—1.8 0359 .0352 .0344 .0336 .0329 .0322 .0314 .0307 .0300 .0294 
—1.7 0446 .0436 .0427 .0418 .0409 .0401 .0392 .0384 .0375 .0367 
—1.6 0548 .0537 .0526 .0516 .0505 .0495 .0485 .0475 .0465 .0455 
== 1,5 0668 .0655 .0643 .0630 .0618 .0606 .0594 .0582 .0570 .0559 
— 1.4 0808 .0793 .0778 .0764 .0749 .0735 .0722 .0708 .0694 .0681 
13 0968 .0951 .0934 .0918 .0901 .0885 .0869 .0853 .0838 .0823 
Sh2 1151 .1131) .2112 .1093 .1075 .1056 .1038 .1020 .1003 .0985 
= 11 1357 .1335 .1314 .1292 .1271 .1251 .1230 .1210 .1190 .1170 
— 1.0 1587 .1562 .1539 .1515 .1492 .1469 .1446 .t423 .1401 .1379 
— 9 1841 .1814 .1788 .1762 .1736 .1711 .1685 .1660 .1635 .1611 
— 8 2119 .2090 .2061 .2033 .2005 .1977 .1949 .1922 .1894 .1867 
ae 2420 .2389 .2358 .2327 .2297 .2266 .2236 .2206 .2177 .2148 
— 6 2743 .2709 .2676 .2643 -.2611 .2578 .2546 .2514 .2483 .2451 
= eo 3085 .3050 .3015 ,2981 .2946 .2912 .2877 .2843 .2810 .2776 
— 4 3446 .3409 .3372 .3336 .3300 .3264 .3228 .3192 .3516 .3121 
=) 3 3821 .3783 .3745 .3707 .3669 .3632 .3594 .3557 .3520 .3483 
—- 2 4207 .4168 .4129 .4090 .4052 .4013 .3974 .3936 .3897 .3859 
- 4602 .4562 .4522 .4483 .4443 .4404 .4364 .4325 .4286 .4247 
- 0 5000 .4960 .4920 .4880 .4840 .4801 .4761 .4721 .4681 .4641 


Reprinted with permission of The Macmillan Company from INTRODUCTION TO 
PROBABILITY AND STATISTICS, second edition, by B. W. Lindgren and G. W. 
McElrath. Copyright © 1966 by B. W. Lindgren and G. W. McElrath. 
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tas 
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io) 
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io) 
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oe) 
io) 
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— 
ioe) 
tA 
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co 
in 
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io) 
tA 
\O 
\o 
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ON 
NO 
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RO BR oe eet ete et 
e. e. 7 e e e. . . . . cf . e 


22 | 9861 .9864 .9868 .9871 .9874 .9878 .9881 .9884 .9887 .9890 


INDEX 


A priori, 4 posteriori probability, 118 
Absorbing state, 260 
Absorption probability, 288 
Absorption time, 245 
Allocation models, 52 
Almost surely, 233 
Aperiodic class, 285 

Area, 19, 41 

Arithmetical density, 36 
Artin, 169 

Asymptotically equal, 211 
Axioms for probability, 24 


Banach’s match problem, 70, 311 
Bayes’ theorem, 118 
Bernoulli, J., 228 
Bernoulli’s formula, 36 
Bernoullian random variable, 90, 169, 
180 
Bertrand’s paradox, 97 
Binomial coefficient, 49 
generalized, 131 
properties, 55—58, 190 
Binomial distribution, 90 
Birth-and-death process, 282 
Birthday problem, 63 
Boole’s inequality, 29 
Borel, 95 
Borel field, 109 
Borel’s theorem, 232 
Branching process, 292 
Brownian motion, 249 
Buffon’s needle problem, 155 


Card shuffling, 300 
Cardano’s paradox, 166 
Cauchy functional equation, 155 
Cauchy-Schwarz inequality, 169 
Central limit theorem, 222 
Chapman-Kolmogorov equations, 255 
Characteristic function, 184 
Chebyshev’s inequality, 228, 235 
Chi-square distribution, 235 
Chinese dice game, 69 
Class of states, 260 
period of, 284 
Class property, 266 
Coin-tossing, 35 
Communicating states, 260 
Complement, 3 
Conditional expectation, 124 
Conditional probability, 111 
basic formulas, 116-118 


Convergence of distributions, 222 
Convolution, 179, 190 
Coordinate variable, 74 
Coupon-collecting problem, 159 
Countable set, 23 

Countable additivity, 30 
Correlation, 169 

Covariance, 169 

Cramér, 223 

Credibility of testimony, 152 


D’Alembert’s argument, 25, 51 
De Méré’s dice problem, 138 

De Moivre-Laplace theorem, 215 
De Morgan’s laws, 6 

Density function, 91 

Dice patterns, 69 

Difference, 8 

Difference equation 243 
Discrete, 92 

Disjoint, 9 

Distribution function, 82, 92, 101 
Doob, 306 

Doubly stochastic matrix, 281 
Doubling the bet, 188 

Duration of play, 275 


Ehrenfest model, 258, 283, 298 

Einstein, 123, 249 

Elementary probabilities, 82 

Empty set, 2 

Equally likely, 24, 33 

Ergodic theorem, 233 

Errors in measurements, 166, 218, 225 

Event, 31 

Exchangeable events, 134 

Expectation, 83, 110, 156 
addition theorem, 157, 161 
approximation of, 94 
of function of random variable, 85, 93 
multiplication theorem, 164 
expression by tail probabilities, 186, 

187 


Expected return time, 275 
Exponential distribution, 98 
memoryless property, 114, 155 


Factorial, 48 

Favorable to, 141 

Feller, v, 69, 107, 233, 307 
Fermat-Pascal correspondence, 27, 138 
Finite additivity, 29 
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First entrance time, 262 
decomposition formula, 264 

Fourier transform, 184 

Frequency, 20, 233 

Fundamental rule (of counting), 44 


Gambler’s ruin problem, 242, 246 
Gamma distribution, 189 


Gauss-Laplace distribution, 217 
Generating function, 177 
as expectation, 182 
multiplication theorem 180, 183 
of binomial, 180; geometric, 181; 
negative binomial, 181; Poisson, 
193; transition probabilities, 265 
Genetical models, 144, 291, 299 
Genotype, 145 
Geometrical distribution, 88 
Geometrical probability problems, 96, 
135 


Hardy-Weinberg theorem, 147 
hereditary problem, 148 
holding time, 200 
Homogeneity 

in time 203, 252 

in space, 258 
Homogeneous chaos, 197 
Homogeneous Markov chain, 253 (see 

**Markov chain’’) 


Identically distributed, 222 

Independent events, 34, 136 

Independent random variables, 135, 137 

Indicator, 13, 163 

Infinitely often, 247, 267 

Initial distribution, 253 

Integer-valued random variable, 86 

Intensity of flow, 200 

Inter-arrival time 162, 200 (see also 
‘waiting time’’) 

Intersection, 4 


Joint density, distribution function, 103 
Joint probability formula, 117 
Joint probability distribution, 101 


Keynes, 114, 122, 128, 307 
Khintchine, 228 
Kolmogorov, 138, 286 


Laplace, 119 (see also under ‘De 
Moivre” and ‘‘Gauss’’) 
law of succession, 123 
Laplace transform, 184 


of waiting times, 201 


Index 


Last exit time, 263 
decomposition formula, 264 
Law of large numbers, 227 
J. Bernoulli’s, 228 
strong, 232 
Law of small numbers, 227 
Leading to, 260 
Lévy, 223, 249 
Lottery problem, 158 


Marginal density, 103 
Marginal distribution, 101 
Markov, 228, 252 
Markov chain, 253 
examples, 256-260 
non-homogeneous, 253, 259 
of higher order, 304 
positive-, null-recurrent, 282 
recurrent, nonrecurrent, 271 
reverse, 304 
two-state, 280 
Markov property, 252 
strong, 270 
Martingale, 305 
Mathematical expectation (see ‘‘expecta- 
tion’’) 
Matching problems, 64, 163, 170, 197 
Maximum and minimum, 140 
Measurable, 24, 109 
Median, 107 
Moments, 167 
Montmart, 191, 242 
Multinomial coefficient, 50 
Multinomial distribution, 173 
Multinomial theorem, 171 


Negative binomial distribution, 181 

Neyman-Pearson theory, 152 

Non-homogeneous Markov chain, 253, 
259 

Non-Markovian process, 260 

Nonmeasurable, 38 

Nonrecurrent, 266 (see under ‘“‘recur- 

rent’’) 

Normal distribution, 217 
convergence theorem, 223 
moment-generating function, 

ments, 219 
positive, 235 
table of values, 320-321 
Normal family, 220 
Null-recurrent, 282 


mo- 


Occupancy problems, 185 (see also “‘al- 
location models’’) 

Occupation time, 275 

Optional time, 269 

Ordered k-tuples, 44 


Index 


Pairwise independence, 142 
Partition problems, 53 
Pascal’s letters to Fermat, 27, 138 
Pascal’s triangle, 55 
Permutation formulas, 48-49 
Persistent (see ‘‘recurrent’’) 
Poincaré’s formula, 162 
Poisson, 128 
Poisson distribution, 192, 204 
properties, 206-208 
models for, 197-199 
Poisson limit law, 194 
Poisson process, 204 
distribution of jumps, 209 
finer properties, 236 
Poisson’s theorem on sequential sam- 
pling, 128 
Poker hands, 69 
Pélya, 129, 223, 259 
Positive-recurrent, 282 
Probability (classical definition), 24 
Probability distribution, 82 
Probability measure, 23 
construction of, 31 
Probability of absorption, 288 
Probability of extinction, 294 
Probability triple, 109 
Problem (for other listings see under 
key words) 
of liars, 152 
of points, 27, 190 
of rencontre, 163 
of sex, 115 


Quality control, 59 
Queuing process, 301-302 


Random mating, 145 
Random variable, 75, 109 
countable vs. density case, 93 
discrete, continuous, 92 
function of, 71 
range of, 81 
with density, 92 
Random vector, 73, 101 
Random walk, 240 
free, 256 
generalized, 257-258 
in higher dimensions, 259, 272~273 
on a circle, 281 
recurrence of, 247, 271-273 
with barriers, 257 
Randomized sampling, 208 
Recurrent, 266, 268 
Markov chain, 271 
random walk, 247 
Renewal process, 299 
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Repeated trials, 33 
Riemann sums, 94 


Sample point, space, 2 
Sample function, 205 
Sampling (with or without replacement) 
vs. allocating, 53 
with ordering, 46 
without ordering, 48-50 
Sequential sampling, 125 
Significance level, 226 
Simpson’s paradox, 143 
Size of set, 2 
St. Petersburg paradox, 107 
Standard deviation, 167 
State space, 252 
Stationary distribution, 279 
Stationary transition probabilities, 253 
Stationary process, 134, 148, 279 
Steady state, 274 
equation for, 277 
Stirling’s formula, 211, 237 
Stochastic matrix, 255 
Stochastic process, 125, 205 
Stochastically closed, 286 
Stochastic independence (see ‘“‘independ- 
ence’’) 
Stopping time, 269 
Strong law of large numbers, 232 
Strong Markov property, 270 
Summable, 156 
Symmetric difference, 8 


Taboo probability, 264, 303 

Tauberian theorem, 276 

Time parameter, 125 

Tips for counting problems, 59 

Total probability formula, 118 

Transient (see ‘‘nonrecurrent’’) 

Transition matrix, 255 

Transition probability, 255 
limit theorems for, 275, 286 


Uniform distribution, 87, 96 
Union, 3 


Variance, 167 
addition theorem, 168 


Waiting time, 88, 97, 181, 200, 259 
Wald’s equation, 88 
Wiener process, 249 


Zero-or-one law, 295 
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Revisado por Lawrence E. Jerome, Este es un excelente 
texto de matematicas, tanto en terminos de enfoque como 
de legibilidad; el estilo de escritura es ligero, casi ‘breezy’ 
en ocasiones. Hay muchos ejemplos y problemas con 
soluciones, intercalados con interesantes aspectos e ideas 
historicas. Habria apreciado un tratamiento mas completo 
de la literatura historica, que es tan dificil de rastrear en el 
campo de las matematicas. La ventaja del enfoque de 
Chung es que utiliza tecnicas matematicas mas poderosas 
(teoria de conjuntos,), funciones de distribucion, 
transformaciones, etc.) para operar en el espacio de 
prueba en contraposicion al espacio muestral tradicional. 
Espacio de prueba, como implica el término, *10295 
Menhart Lane, Cupertino, CA 95014, EE. UU. se refiere a 
representaciones del espacio experimental real; El espacio 
muestral es la representaci6n combinatoria de todos los 
resultados posibles y, por lo tanto, es mucho mayor y 
engorroso. La clave es el uso que hace Chung de funciones 
indicadoras que permiten operar en el espacio de prueba 
porque ‘ la expectativa de una variable aleatoria 
indicadora es solo la probabilidad del evento 
correspondiente: E(IA) = P(A)’ (p. 163). Curiosamente, 
Chung no deja muy clara esta ventaja de su enfoque. No 
muestra cuanto mas facil es operar en el espacio de 
prueba en lugar del espacio de muestra. Por ejemplo, su 


derivacion de la formula binomial para... 


